cs.DB Machine Learning Software Engineering

sql4ml A Declarative End-to-End Workflow for Machine Learning

February 04, 2026

Reading time: 47 minute

...

#paper #research

📝 Original Paper Info

- Title: sql4ml A declarative end-to-end workflow for machine learning
- ArXiv ID: 1907.12415
- Date: 2019-08-05
- Authors: Nantia Makrynioti, Ruy Ley-Wild, Vasilis Vassalos

📝 Abstract

We present sql4ml, a system for expressing supervised machine learning (ML) models in SQL and automatically training them in TensorFlow. The primary motivation for this work stems from the observation that in many data science tasks there is a back-and-forth between a relational database that stores the data and a machine learning framework. Data preprocessing and feature engineering typically happen in a database, whereas learning is usually executed in separate ML libraries. This fragmented workflow requires from the users to juggle between different programming paradigms and software systems. With sql4ml the user can express both feature engineering and ML algorithms in SQL, while the system translates this code to an appropriate representation for training inside a machine learning framework. We describe our translation method, present experimental results from applying it on three well-known ML algorithms and discuss the usability benefits from concentrating the entire workflow on the database side.

💡 Summary & Analysis

This paper introduces sql4ml, a system designed to integrate relational databases and machine learning (ML) frameworks by allowing users to express both feature engineering and ML algorithms in SQL. The primary motivation is to simplify the interaction between these two domains often seen in data science tasks. Traditionally, data preprocessing and feature engineering happen within a database, while learning occurs in separate libraries like TensorFlow or PyTorch. This fragmentation requires users to juggle different programming paradigms and software systems, adding complexity and overhead.

The key contribution of sql4ml is its ability to translate SQL queries into ML framework code (specifically TensorFlow). Users can define their ML models using familiar SQL syntax, which sql4ml then translates into TensorFlow operations. This translation process includes handling linear algebra operations, mathematical functions like exponentials and logarithms, and even complex optimization algorithms such as gradient descent.

The paper presents experimental results demonstrating the effectiveness of this approach on popular ML algorithms including Linear Regression, Logistic Regression, and Factorization Machines. The results highlight how sql4ml can streamline data science workflows by unifying the process within a single SQL-based environment.

This system is significant because it reduces complexity in developing ML models, making it easier for users to manage their entire workflow from feature engineering through model training without switching between different programming environments or manual intervention. It allows for more efficient experimentation and iteration cycles in the development of data-driven solutions.

📄 Full Paper Content (ArXiv Source)

# On DB and ML foundations

In this section we briefly discuss some fundamental differences between relational databases and ML frameworks, which pose challenges in integrating functionality of the two worlds. Some of the differences stem from the foundations of relational algebra and linear algebra, whereas others are related to programming paradigms and the use of declarative versus procedural languages.

Relational databases are built around the abstraction of relation, which is an unordered set of tuples. There is no notion of order between the columns of a tuple. Because of this, the user is not concerned with the particular physical representation of relations in the database, and the system is free to optimize the representation (e.g., with indexes). On the other hand ML frameworks operate on tensors, which are multidimensional arrays. The word “array” is key here. It indicates that the elements of a tensor are accessed via indices, basically natural numbers from $`0`$ to $`I-1`$ (or $`1..I`$), which in turn implies a specific order on the elements. The connection between boolean matrices and relations is well-established, whereas recent work studies mappings between matrices of numeric values and $`K`$-relations.

On the functionality front, relational operators map input relations to an output relation and include selection, projection, joins and set-theoretic operators. In contract, linear algebra operators include matrix multiplication, summations over rows and columns, matrix transpose, element-wise arithmetic operations and others. In addition to this and as far as programming model is concerned, databases support declarative languages, traditionally SQL and in some cases Datalog, whereas ML frameworks usually prefer imperative languages, such as Python and C++. It has been shown that a set of commonly used linear algebra operators can be implemented in SQL using element-wise mathematical operations and aggregations , , . For example, a matrix-vector multiplication is an element-wise multiplication between columns and a summation over the elements of each tuple in a relation. However, other parts of machine learning algorithms are more difficult to implement inside a database.

A significant limitation of SQL is the lack of iteration constructs. RDBMS support for iterations is typically limited to fixed-points over sets. Iterative processes are very common in machine learning though. The training of a ML model is one of them and is done using well-known mathematical optimization algorithms, such as gradient descent. Because of their wide use in machine learning, ML frameworks provide highly optimized implementations of such algorithms.

Another very useful capability of some ML frameworks, which is not currently supported in databases, is automatic differentiation. Systems that do not provide automatic differentiation require the user to explicitly write the derivatives of the objective function, which becomes nontrivial for complicated functions.

On the other hand, ML systems lack support when it comes to data processing. Unlike databases, a tensor library like TensorFlow does not provide relational operators (e.g., joins and filters). While a ML framework can be combined with other libraries in the host language providing tabular operations, such as Python Pandas , there are two main issues associated with this approach. First, the scalability of these solutions is limited by the main memory of the machine. Second, such libraries offer a quite procedural way of writing relational operations that exposes the order in which they are executed, e.g. A.join(B).join(C). On the other hand, databases incorporate an extensive body of work to optimize queries just describing the output set and handle large datasets.

Introduction

Today data scientists suffer a fragmented workflow: typically feature engineering is performed in a relational database, while predictive models are developed with a machine learning library. Data scientists must tackle a fundamental impedance mismatch between relational databases and ML with the attendant accidental complexity of bridging the two worlds. The workflow involves transforming between data represented in relations and tensors (multi-dimensional arrays), context-switching between the declarative SQL and imperative Python (or R) programming paradigms, and plumbing between siloed software systems. While relational databases provide advanced query optimization and permit selecting features by efficiently joining normalized tables, they are not always suited/optimal for linear algebra and have limited support for the type of iteration required for ML algorithms. In contrast, ML libraries provide mature support for linear algebra operations, automatic differentiation and mathematical optimization algorithms, although they are not adequate for relational queries and require the data matrix in a denormalized form.

The overhead for interfacing databases and ML libraries materializes as increased costs: code increases in complexity and maintenance burden, projects need to specialize labor into data engineers on the SQL side and data scientists on the Python/R side, and the time to develop a model is slowed down by moving data through the pipeline and coordinating multiple people. The data science process (see Figure 1) is often experimental trying out different features, training sets, and ML models, thus the overhead costs compound on every iteration.

We envision a more efficient workflow where an individual data scientist can perform both data management and ML in a single software system. We present as an integrated approach for data scientists to work entirely in SQL, while using a ML library like TensorFlow as the backend for ML tasks. We target users that are familiar with SQL and would like to have more advanced capabilities for data analysis. They need to understand the math for defining an ML model and write it in SQL, but do not need to deal with the lower level details of ML libraries, such as data batching and representation. Data representation in relations is standardized in the SQL language.

By synergizing a database and ML library, the user has more flexibility to define the algorithm of her choice compared to prepackaged ML algorithms. Our implementation specifically targets SQL and TensorFlow (Python API), although the ideas are equally applicable to other database languages such as Datalog, other ML libraries such as PyTorch, and other data science languages such as R.

In addition to using SQL for traditional data management, the user can also define and train supervised ML models in SQL. She provides concrete SQL tables of features and target values (labels) of training observations, and specifies the ML model by SQL queries for the model prediction function and objective function in terms of an empty table of weights (parameters or coefficients) which will be learned during training. generates the boilerplate code to move the features from relations in the database to CSV files on disk or directly to tensors in the ML library.

automatically translates SQL queries defining an ML model to TensorFlow Python code, which is sufficient to invoke gradient descent to find the optimal weights that minimize the objective function. The objective function written in SQL can be understood as a query for the optimal weights that minimize it. The generated TensorFlow/Python code is effectively a physical plan of the query. thus makes it easier for a single person to run the full data science cycle, because it eliminates the need for two programming paradigms and automates moving data between systems, thus leaving more resources to find the right features and model.

Finally, moves the computed weights from the ML library back to the database. Thus, the data scientist can stay in the database to compute predictions on test data and statistics about the ML model, such as percent error or precision and recall. Furthermore, the database can be used to store multiple trained ML models and keep track of the features that were used in each ML model, in contrast to ad hoc management of weights in files on the filesystem. In this way, she can also use the database as a system for keeping records of experimental results and manage the entire ML lifecycle in one place.

In summary, we provide an end-to-end workflow that starts in the database, pushes the training of the model to the ML library, and stores the computed weights back in the database. A core contribution is a method for translating objective functions of ML algorithms written in SQL to linear algebra operations in an ML library. We demonstrate how our translation method works and apply it on popular machine learning algorithms, including Linear Regression, Factorization Machines , and Logistic Regression.

The rest of the paper is organized as follows. Section 10 presents an overview of workflow in terms of input and generated code. In section 11 we discuss differences in the functionality and data structures between relational databases and ML frameworks, whereas in section 12 we describe the steps of our approach in detail. Sections 13 and 14 present the usability benefits of and an experimental evaluation respectively. Finally, section 15 discusses related work and section 16 concludes the paper.

Bridging the gap

Given the mismatch between the data and programming paradigms in databases and machine learning systems, we propose a unified approach where both can be programmed from SQL. In , the data analyst writes SQL queries both to prepare the features and to define the ML task. In particular, the user writes the objective function of the ML model in SQL and the system automatically translates it to the corresponding TensorFlow code. Moreover, the data are transparently moved from relations in the database to tensors in the ML system.

Common components in ML algorithms

A large class of supervised algorithms in ML, including linear regression, logistic regression, Factorization Machines and SVM, follow a common pattern in how they work. Unsupervised algorithms, such as k-means, can be reformulated to fit into this pattern, too, if we replace the concept of features with data points. We explain the encountered similarities below.

Prediction and Objective Functions

Given a set of input features $`x`$, a machine learning algorithm can be specified by an objective function. For example, in logistic regression the relationship between the target variable $`h_\theta(x)`$ and the input features $`x`$ is defined by the prediction function

MATH

\begin{equation}
\label{sigmoid_equation}
h_\theta(x)=\frac{1}{1+e^{-x^\top\theta}}
\end{equation}

Click to expand and view more

where $`\theta`$ are the unknown weights of the ML model to be optimized. The objective function (loss function or cost function) of logistic regression is

MATH

\begin{equation}
\label{logistic_loss}
-\frac{1}{n}\sum_{i}^{n}[y_i\log(h_\theta(x_i)) + (1-y_i)\log(1-h_\theta(x_i))]
\end{equation}

Click to expand and view more

Optimization

The minimization/maximization of the objective function is done using a mathematical optimization algorithm, such as gradient descent. Gradient descent is an iterative method, which computes the derivative of the objective function and updates the ML model weights in each iteration. Given a system such as TensorFlow that supports automatic differentiation and implements a mathematical optimization algorithm, the user only needs to provide the objective function in terms of input features and weights.

End-to-end workflow

We propose a workflow, where the user writes SQL to both query the data and express the objective function of the ML model. A TensorFlow representation of the SQL queries defining the ML model and consisting of linear algebra operators and tensors is generated.

This representation is created in two steps. First, the SQL code defining the objective function of the ML model is converted to an abstract syntax tree (AST), which is then automatically translated to code in the host language of a ML framework. This code is supplemented with a few more lines calling gradient descent on the objective function inside a loop. The entire program is then executed on the ML framework. In order to feed data from relations to tensors, we automatically generate code either for connecting and querying the database using a suitable driver or for exporting data to files. After the generated code is executed and the optimal weights for the objective function are computed, they are transferred to relations inside the database. The proposed workflow is displayed in Figure 2.

Proposed workflow between a DB and a ML system (numbers inside parenthesis indicate the subsection where each stage is described)

Note that joins needed in feature engineering are still evaluated inside the database, whereas any linear algebra operation written in SQL is translated to the appropriate operators of the ML framework and is evaluated in it. On ML frameworks supporting it, that also means the linear algebra part can run on GPUs. Relational databases provide quite advanced query optimization and there are already mature solutions for linear algebra operations, automatic differentiation and mathematical optimization algorithms outside the database ecosystem. In our approach the key components of a machine learning algorithm are provided using SQL, whereas the training of the ML model is executed on a system specializing in machine learning, hence avoiding the need for implementing an iterative process in the language of the database. Essentially, we attempt to combine the best from both worlds, but at the same time do so in a transparent way, in order to reduce the manual work required by the user and the need to be familiar with lower level details of machine learning frameworks, such as data representation. Also by creating an under the hood synergy between a database and mathematical optimizers of ML frameworks, the user has more flexibility to define the algorithm of her choice compared to using ready made ML algorithms.

ML algorithms in SQL

In this section we present how machine learning algorithms are written in SQL through the example of logistic regression. As stated earlier, the user writes the objective function of the ML model in SQL.

We implement logistic regression on the following schema:

PLAINTEXT

observations(rowID: int)
features(rowID: int, featureName: string, 
v: double)
targets(rowID: int, v: double)
weights(featureName: string, v: double)

Click to expand and view more

The observations table stores the identification numbers of training observations. For every observation and feature, the features table has an entry with the feature’s value. For example, the entry (1, "feature1", 30.5) means that feature "feature1" of observation 1 has value 30.5. The targets table stores the observed target values of the training observations. Data for these three tables are provided by the user, extensional to the database. Finally, the weights table is filled in by learning the weights of a model in TensorFlow.

Given these tables, we define SQL views sigmoid for the sigmoid function of the logistic regression model and objective for the objective function (see Listing [logistic_loss_sql]). These functions embody the Equations [sigmoid_equation] and [logistic_loss] and are expressible in SQL using numeric and aggregation operations. Instead of evaluating them inside a database, they are translated to TensorFlow linear algebra functions for evaluation inside TensorFlow. Note that the user defines the model by writing prediction and objective functions in SQL, while generates the code to train the ML model.

SQL

CREATE VIEW product AS 
  SELECT SUM(features.v * weights.v) AS v, 
  features.rowID AS rowID
  FROM features, weights
  WHERE features.featureName=weights.featureName
  GROUP BY rowID;

CREATE VIEW sigmoid AS 
  SELECT product.rowID AS rowID, 
  1/(1+EXP(-product.v)) AS v 
  FROM product;

CREATE VIEW log_sigmoid AS 
  SELECT sigmoid.rowID AS rowID, 
  LN(sigmoid.v) AS v 
  FROM sigmoid;

CREATE VIEW log_1_minus_sigmoid AS 
  SELECT sigmoid.rowID AS rowID, 
  LN(1-sigmoid.v) AS v 
  FROM sigmoid;

CREATE VIEW objective AS 
  SELECT (-1)*SUM((targets.v * log_sigmoid.v) + ((1-targets.v) * log_1_minus_sigmoid.v)) AS v 
  FROM targets, log_sigmoid, log_1_minus_sigmoid 
  WHERE targets.rowID=log_sigmoid.rowID and log_sigmoid.rowID=log_1_minus_sigmoid.rowID;

        Click to expand and view more
      

The SQL code as is cannot be evaluated because the weights table is empty. Even if the weights were initialized with random values, only a single round of evaluation would be possible. No iterative process is defined in the code of Listing [logistic_loss_sql] to update tuples in weights. Using the SQL representation we produce a representation in TensorFlow/Python API to be evaluated in TensorFlow.

From SQL to TensorFlow API

In this section we describe the process of translating machine learning algorithms from SQL to TensorFlow API. We start with a simple example, the prediction function of linear regression.

MATH

\begin{equation}
\label{lr_prediction}
y_i = x_i^\top \theta
\end{equation}

Click to expand and view more

This may be implemented in SQL as follows:

SQL

CREATE VIEW predictions AS
  SELECT features.rowID AS rowID,
  SUM(features.v * weights.v) AS v
  FROM features, weights
  WHERE features.featureName = weights.featureName
  GROUP BY rowID;

        Click to expand and view more
      

Using TensorFlow API we could write this as follows:

PYTHON

predictions = tf.tensordot(features, weights, axes=1)

        Click to expand and view more

In the mathematical equation and TensorFlow, dimensions between $`x_i`$ and $`\theta`$ need to match, in order for the matrix-vector multiplication to be executed. Hence, the transpose operation on $`x_i`$ in Equation [lr_prediction]. In SQL this part is implemented as an element-wise multiplication between columns and a group by aggregation, which implies that dimensions are irrelevant in this case. The translation process generates the TensorFlow snippet above from the corresponding SQL snippet.

A SQL program is a list of SQL queries. We handle a specific (albeit general enough to express supervised ML algorithms) type of select query that is described in algorithm [translateView]. We assume that there is only one numeric expression per query, although it may involve multiple nested computations, e.g. (a+b)/c. Because we do not handle subqueries, we store the result of a query as a view and use the view name in subsequent queries when needed. Algorithm [translateView] displays the high-level steps to translate a create view query to a TensorFlow command. The translation proceeds by applying typical steps from compiler theory. A SQL query is first tokenized (lexing) and parsed. Then an AST is produced for it. Based on the AST we extract the numeric expression involved in the query as well as the name of the view, and generate an equivalent TensorFlow command.

The AST represents necessary information for translating a SQL query to functions of TensorFlow API. This information includes the operators that are involved in a SELECT numeric expression, the columns on which the operators operate, the tables where the columns came from, as well as columns involved in group by expressions. In order to generate the TensorFlow code, we need to match SQL numeric operators and aggregation functions with linear algebra operators. In Table 1 we present equivalences between these two categories of operators encountered in examples in this paper.

Linear algebra operator	SQL	TensorFlow
matrix addition	+	tf.add
matrix subtraction	-	tf.subtract
Hadamard product	$`\ast`$	tf.multiply
Hadamard division	/	tf.div
matrix product	SUM(_ * _)	tf.tensordot

Equivalences between linear algebra, SQL and TensorFlow operators

As soon as we extract the numeric expression among the SELECT expressions of a query, we analyze it recursively by visiting each subexpression and decomposing it to the operators and the columns on which the operators take place as described in Algorithm [translateNumericExpr]. We apply a compositional translation where there is a mapping that preserves the structure between the SQL and TensorFlow expression. If the expression is a column or a constant, we output a TensorFlow constant or variable name. Otherwise, we match it to a TensorFlow operation according to Table 1 and proceed with translating the operands of the expression. Recall that tensors store only real values, so analyzing column projections/expressions with other types is not applicable.

From SQL to TensorFlow AST of `features`-`weights` product

Hence, in the select query of view predictions we will translate only the expression SUM(features.v * weights.v) to TensorFlow code. In Figure 3 we present the TensorFlow AST that corresponds to the SQL AST of a matrix-vector multiplication. Based on Table 1 we match the combination of function SUM and $`\ast`$ with tf.tensordot. We then scan the columns features.v and weights.v, and identify that they belong to tables features and weights. We use these table names for the tensors participating in the corresponding TensorFlow operation. Finally, the argument axes=1 in tf.tensordot is used to define the axes over which the sum of products takes place. Since our work focuses on data that can be stored in up to two-dimensional tensors, i.e. matrices and vectors, we will always use 1 as the value of the axes argument, which is equivalent to matrix multiplication.

Apart from linear algebra, other mathematical operations may be encountered in machine learning algorithms. An example of this is the sigmoid function used in logistic regression, which is defined in Equation [sigmoid_equation]. The sigmoid function is implemented in SQL as follows.

SQL

CREATE VIEW sigmoid AS
  SELECT product.rowID  AS rowID,
  (1/(1+EXP(-product.v))) AS v
  FROM product;

        Click to expand and view more
      

In TensorFlow API we can write this using the supported sigmoid function directly

PYTHON

sigmoid = tf.sigmoid(product)

        Click to expand and view more

or in a more verbose way as follows:

PYTHON

sigmoid = tf.div(1,
             tf.add(1,tf.exp(tf.negative(product))))

        Click to expand and view more
      

The translation process is very similar to the former example, except that here we have the exponential function operating on the product between the features of each observation and the weights $`\theta`$. As a result apart from the mapping between element-wise numeric and linear algebra operators, we also need to match mathematical functions used in SQL to functions of TensorFlow API. All mathematical functions we have encountered so far in our SQL examples are also included in the TensorFlow API. We present the correspondence between the two in Table 2.

Math function	SQL	TensorFlow
sum	SUM	tf.reduce_sum
count	COUNT	tf.size
average	AVG	tf.reduce_mean
exponential	EXP	tf.exp
natural logarithm	LN	tf.log
negative number	-	tf.negative
square	POW(_, 2)	tf.square

Matching mathematical functions in SQL and TensorFlow API

Assuming that the products between features and weights has been already defined in a different query and based on the SQL AST in Figure 4, we identify four operators, /, +, EXP and - in the numeric select expression of view sigmoid. Each of them will be matched to a TensorFlow operator forming the AST at the right of Figure 4. More specifically - is matched with tf.negative. We treat it as a different case from tf.subtract, when it is applied to a single constant / column. EXP is matched with tf.exp, whereas / and + are matched with tf.div and tf.add respectively. We combine all operators to a single TensorFlow command by nesting them. The generated TensorFlow command will be

sigmoid = tf.div(1, tf.add(1,
       tf.exp(tf.negative(product)))).

From SQL to TensorFlow AST of sigmoid function

Finally, when we have a group by clause combined with an aggregation function in SQL, such as SUM or AVG, we need to analyze it, in order to extract necessary information regarding the parameters of the matched reduce operation in TensorFlow. In reduce operations in TensorFlow, the user also has to specify the dimensions to reduce over, i.e. the axis parameter. If the group by in the SQL statement is on a primary key, then we need to reduce across columns and the value of axis parameter is 1. If the group by column is not a key, then we need to reduce across rows and the axis parameter is 0. According to TensorFlow API, the axis parameter takes values in the range [-rank(input_tensor), rank(input_tensor)], but as we limit our translation method to matrices and vectors, we do not deal with higher dimensional tensors of rank greater than 2. In case there is no group by clause, that means that we give the value None to the axis parameter, and the reduction happens across all dimensions, as for example in objective query of logistic regression in Listing [logistic_loss_sql].

To provide a translation, apart from the SQL code, we also request from the user a few information hints. Those hints can be provided in the form of command line arguments or in a configuration file. More specifically, the names and dimensions of the following tables are needed:

Tables that store features, e.g. features,
Columns in the aforementioned tables that store names of the features, e.g. featureName in table features,
Tables that store weights of the ML model, which are initially empty and will be populated once the gradient descent algorithm has computed the optimal values for them, e.g. weights,
The dimensions of the tables that store weights of the ML model, which can be derived from the cardinalities of all columns in the table except from the one storing the value of the weight and whose type will be double, e.g. [2]. The number of weights should also be equal to the number of features.
The table that stores target values of training observations, e.g. targets.

Based on these tables, we are able to determine which tensors will be defined as constants or variables at the TensorFlow side, as well as provide their dimensions. Finally, the following information are necessary for the training loop and the connection to the database.

Hyperparameters for the gradient descent algorithm: the number of iterations to run (e.g. 1000) and the learning step (e.g. 0.00001),
A url and the credentials of a user to connect to the database, e.g. mluser@localhost/items.

Following the translation process, the TensorFlow code that is being generated given the SQL snippet of Listing [logistic_loss_sql] is displayed in Listing [logistic_tf].

PYTHON

product=tf.tensordot(features,weights,axes=1)
sigmoid=tf.div(tf.to_double(1),tf.add(tf.to_double(1),tf.exp(tf.negative(product))))

log_sigmoid=tf.log(sigmoid)
log_1_minus=tf.log(tf.subtract(tf.to_double(1),sigmoid))

objective=tf.negative(tf.reduce_sum((tf.add(tf.multiply(targets,log_sigmoid)),(tf.multiply(tf.subtract(tf.to_double(1),targets),log_1_minus))),1))

optimizer = tf.train.GradientDescentOptimizer(3.0e-7)
train = optimizer.minimize(objective)

with tf.Session() as session:
    session.run(tf.global_variables_initializer())
    for step in range(1000):
        session.run(train)
        print("objective:", session.run(objective))

        Click to expand and view more
      

Moving data from relations to tensors

So far we described how we translate SQL numeric and aggregation operations into TensorFlow operators. In order to execute the TensorFlow program and compute the optimal weights of the ML model, we need to feed the tensors with data. Data are initially stored as relations inside the database. So we need a mechanism to transfer the data from relations to tensors. However, in section 11 we described the mismatch between the two data structures.

Let us start with possible ways of storing training data in relations. Suppose each observation has a number of features and a target value. We could store this information in three relations: observations, features and targets. The schema of these relations would be

PLAINTEXT

observations(rowID: int)
features(rowID: int, name: string, v: double)
targets(rowID: int, v: double)

Click to expand and view more

Denormalization

Given this schema we need to transfer the feature values of each observation to a matrix. Although relation features stores each feature of an observation in a different tuple, e.g., (1, “price”, 3.5) and (1, “size”, 20), for a matrix we need to consolidate all features of an observation to a single row, e.g. (3.5, 20). In addition to this, we are interested only on the real values, discarding the observation IDs and names of features. Hence, we automatically construct a SQL query, which transforms the schema of features relation to a different one storing all features of an observation in a single tuple as it is displayed in Figure 5. The resulting schema is the following:

PLAINTEXT

features(rowID: int, f1: double, f2: double)

Click to expand and view more

We create a column per feature name and store only the value of the feature in it. The pivot query in Listing [pivot_query], which transforms rows to columns for features relation, is generated automatically and is added above the TensorFlow code defining the ML model, so that we can feed tensors with the appropriate data before training begins. Essentially, a relation $`R(id_i, fn_i, fv_i)`$ can be formulated as a $`(1..cardinality(id))\times(1..cardninality(fn))`$ matrix where

MATH

\begin{multline}
M[i][j] = v,  \\ 
i \in 1..cardinality(id), j \in 1..cardninality(fn), v \in \mathbb{R}
\end{multline}

Click to expand and view more

Of course, in order to be able to execute the pivot query, we also add a few lines of boilerplate code to connect to the database and run the query there. The result of the query is then processed with a little bit more of boilerplate code, which discards rowID and keeps only feature values, in order to be fed to a tensor.

SQL

SELECT rowID,
  SUM(CASE WHEN name='f1' THEN v ELSE 0.0 END) AS f1,
  SUM(CASE WHEN name='f2' THEN v ELSE 0.0 END) AS f2
FROM features
GROUP BY rowID;

        Click to expand and view more
      

Target values are fed to a vector in a similar manner. Because this relation stores a single tuple for each rowID, we do not need a pivot query. We rather execute a select query on table targets and keep only the target value in order to store it in a vector.

However, instead of storing all features in a single table, we could have a case where features are stored in different tables. Thus, we would need to join tables to gather the features of an observation. Suppose that we have a database storing items and their sales and we want to use it to train a ML model for future sales prediction. The schema of the database is

PLAINTEXT

-- dimensions
item(itemID: string)
stores(storeID: string)
dates(dateValue: string)
-- observations
observations(itemID: string, storeID: string, dateValue: string)
-- features
familyFeat(itemID: string, family: string, v: int)
cityFeat(storeID: string, city: string, v: int)
-- targets
sales(itemID: string, storeID: string, dateValue: string, v: double)

Click to expand and view more

In the schema above an observation depends on three dimensions: items, stores and dates. The same holds for sales. We can use these data to predict future sales of an item at a particular store and date. The features of the ML model are based on item or store characteristics and are stored in familyFeat and cityFeat. Because families of items and cities are strings, the user has created a one-hot encoding representation in familyFeat and cityFeat relations. Using this representation one can convert categorical variables into a numeric form, where each family and city value becomes a real feature whose value can be either 0 or 1. Hence, in order to gather the feature values for an observation, we need to join observations with familyFeat and cityFeat. Listing [pivot_query_family_city] displays the automatically generated query that joins observations with familyFeat and cityFeat and gathers all features of an observation in a single row. Figure 6 presents the mapping between features from different tables and columns of the matrix storing features for all observations. It displays the export of a table to a dense matrix, but in case of categorical features with lots of zero values exporting to a sparse representation, such as csr , to save on space and time is also possible.

SQL

SELECT 
observations.itemID, observations.storeID, observations.dateValue,
groceryValue, cleaningValue, quitoValue, guayaquilValue 
FROM observations,
(SELECT itemID,
     SUM(CASE WHEN family='grocery'  THEN v ELSE 0 END) AS groceryValue,
     SUM(CASE WHEN family='cleaning' THEN v ELSE 0 END) AS cleaningValue
FROM familyFeat GROUP BY itemID) 
AS familyFeat_temp,
(SELECT storeID,
     SUM(CASE WHEN city='Quito' THEN v ELSE 0 END) AS quitoValue,
     SUM(CASE WHEN city='Guayaquil' THEN v ELSE 0 END) AS guayaquilValue
FROM cityFeat GROUP BY storeID) 
AS cityFeat_temp
WHERE observations.itemID=familyFeat_temp.itemID
 AND observations.storeID=cityFeat_temp.storeID;

        Click to expand and view more
      

Mapping between features from multiple tables and the global features matrix

Tables corresponding to weights of the ML model are also mapped to a single vector. By mapping indices of the weights vector to the initial tables and columns, we are able to store computed weights back to the database after training. Finally, all TensorFlow commands involving feature and weight tensors as generated by their initial representation using normalized tables are rewritten in terms of the global features and weights tables.

Reuse of feature computations

By normalizing features across tables and connecting the observations with them using foreign keys we avoid redundancy in storing shared feature values multiple times. When gathering features to populate a global feature matrix as in the previous subsection, the repetitive nature of features can be exploited to reuse computations. Reuse of computations can be carried out via subqueries or precomputed tables/materialized views, both standard or widely supported techniques in relational databases.

In the example schema of listing [items_schema], observations are based on three dimensions: items, stores and dates. The number of observations is less or equal to $`items \times stores \times dates`$. Features that are based solely on items, stores or dates are shared across observations that concern the same item, store or date. For example, if we have two observations, (“item1”, “store1”, “date1”) and (“item1”, “store1”, “date2”), only the features involving date change between the two observations. Item and store features remain the same. Product family and city may be an item and store feature respectively. Hence, instead of recomputing family and city features for each observation, we can either compute them once using a subquery or precompute them and store them in a materialized view/table.

Continuing on the example above, a query for one-hot-encoding the categorical features of family and city, and gathering all features together could be the following:

SQL

SELECT observations.itemID,observations.storeID,observations.dateValue,
SUM(CASE WHEN family='grocery'  THEN 1.0 ELSE 0 END) AS groceryValue,
SUM(CASE WHEN family='cleaning' THEN 1.0 ELSE 0 END) AS cleaningValue
SUM(CASE WHEN city='Quito' THEN 1.0 ELSE 0 END) AS quitoValue,
SUM(CASE WHEN city='Guayaquil' THEN 1.0 ELSE 0 END) AS guayaquilValue
FROM observations, items, stores 
WHERE observations.itemID=items.itemID AND observations.storeID=stores.storeID
GROUP BY observations.itemID, observations.storeID, observations.dateValue;

        Click to expand and view more
      

This query naively computes the one-hot encoding representation of family and city features using case when expressions for every observation. generates a more efficient version of the query above, as is depicted in Listing [pivot_query_family_city], making use of subqueries that denormalize the categorical features of family and city for every item and store respectively. Then each observation is joined on itemID and storeID with corresponding tuples from familyFeat and cityFeat tables. The result of the subqueries can also be materialized.

Due to the statistics the database holds, it is easy to figure out that the cardinalities of items, stores and dates are smaller than the total number of observations and thus that some feature computations will be repeated. As a result, features that are based on individual dimensions of observations can be automatically precomputed by the system. We report evaluation results comparing time to export features with and without precomputation in section 14.

Implementation

The code for translation and transferring data between data structures is implemented in Haskell ¹. For lexing, parsing and generating the AST of SQL code we used the open source project Queryparser , also implemented in Haskell. Queryparser project supports three sql-dialects (Vertica, Hive and Presto), but we found it adequate for parsing at least select queries, create view and create table commands in MySQL and PostgreSQL, as well. Following the generation of AST, we developed the rest of the code for analyzing it and generating the TensorFlow program end to end.

A translation is essentially a function from a source type to a target type. As such, functional languages are a good fit for developing translation processes. For translating SQL to TensorFlow, SQL is captured in algebraic data types (ADTs). The AST of a query is represented using the ADTs provided by the Queryparser project. We defined extra data types to represent objects of interest particularly to the TensorFlow translation.

The assembly of a TensorFlow command, i.e. the output of translateNumericExpr and getViewName in Algorithm [translateView], is carried out through functions walking over the AST and performing pattern matching on it. Various other functions are implemented in a similar way to extract group by keys, key constraints, column expressions, create table statements and generally any information needed to generate the end-to-end TensorFlow program from the AST. We found that features like pattern matching and tail recursion encountered in functional languages, such as Haskell, make the manipulation of tree structures easier than how it is developed in imperative languages.

Overview of sql4ml

takes as input SQL code and generates Python code consisting either of TensorFlow API functions or other Python modules, which automates stages of the workflow in Figure 1. An overview of the code files involved in is presented in Figure 7 and highlighted next.

User’s code. The user provides SQL code defining which features shall be used for training and the objective function of the ML model (left side of Figure 7).

Generated code. There are four main groups of code that generates, which correspond to the four dotted rectangles of Figure 7. The first regards the transfer of data from relations to tensors. generates SQL queries for exporting feature and target values of training data and Python code for loading these values into tensors. In the second group, there is code for the definition of the ML model. This is generated by translating SQL queries that define the objective function of the ML model to TensorFlow API functions. Finally, boilerplate code is generated for triggering gradient descent and importing the computed values of weights back to database tables after training. The execution order of the generated code goes from top to bottom following the arrow in Figure 7.

DB and ML icons in Figure 7 are used to indicate which parts of the code run on the database and which on the ML framework. This is also denoted by the file extensions .sql and .py on each filename (in we chose to translate SQL to Python code, including functions of the TensorFlow API, and use TensorFlow as an execution engine.) We describe in detail how each part of the code is generated and provide specific examples of user provided and automatically generated code in section 12.

Conclusion and Future Work

We presented , an integrated approach for data scientists to work entirely in SQL, while using a ML framework as an execution engine for ML computations. The approach of automates and facilitates tedious parts of the current fragmented data science workflow. We developed a prototype and showcased our translation method from SQL to TensorFlow API code on well-known ML models. Evaluation results demonstrate that this translation is completed in little time .

As future work we would like to investigate further the use of in defining and training neural networks, such as Recurrent Neural Networks (RNN), as well as the adaptation of the system to unsupervised ML models. Another very interesting line of work is the study of query/code optimization techniques whose application could result in more efficient code on ML framework side. Recent work on in-database machine learning discusses the use of functional dependencies in reducing the dimensionality of ML models. Such techniques could prove useful in our approach, as well, as they could produce lower-dimensional ML models and accelerate training, even if this is executed outside the database.

Usability Benefits

Assuming the data science workflow displayed in Figure 1, we argue that the use of has a number of benefits from a usability perspective.

Easier maintenance

We propose the use of SQL for both feature engineering and the definition of a ML model. It is a common scenario to have a data engineer exporting the appropriate data from a database and a data scientist writing code for a ML model , . The first part involves data processing pipelines built out of relational operators common in SQL, Datalog and data processing APIs, whereas the second is mainly based on linear algebra operators and probabilities, and is expressed in ML framework APIs written in imperative languages. Expressing the entire workflow in a single language increases ease of maintenance, as it is no longer required from the users to be familiar with more than one programming paradigms. Moreover, declarative database languages, such as SQL, are based on high-level, standardized abstractions, which hide their physical implementation from the user.

Facilitating data handling

automates the move of training data from relations to tensors and removes the burden of data exports from the user. In addition to this, processing data with SQL also facilitates a few frequent operations in machine learning. When working with large datasets, training usually happens on batches. To create randomized batches as a means of avoiding implicit bias to a ML model, training data are shuffled. Using a library, such as Python Pandas, on the ML framework side for shuffling requires loading the entire training set in-memory, which results in out-of-memory errors when data are larger than the machine’s memory specs. This situation can be avoided if we use a RDBMS for arranging the tuples of a table in random order. Also, the creation of batches themselves might fail due to memory issues. Lately, TensorFlow ² and PyTorch ³ added data structures for handling datasets incrementally, but if this is not provided by the library, then a typical procedure is to load data in-memory and take random samples to create each batch. In this case, executing SQL queries to fetch tuples per batch can provide a solution to memory issues.

Model management

Because in weights of a model are stored in a table, the user is able to query them. For example, she might be interested in examining the range of weights in order to identify large values using a SQL query similar to the following.

SQL

SELECT featureName, v FROM weights WHERE v > $(threshold);
 
 -- or
 
SELECT featureName, v FROM weights ORDER BY v;

        Click to expand and view more

In ML frameworks models are saved to files, whose extension is not standard. For example TensorFlow uses the .ckpt and .data extensions, whereas PyTorch uses .pt or .pth files. Each API provides its own functions to access parts of these files, though each has its own particularities with which the user needs to get familiar. We argue that declarative database languages provide a standard and higher-level way to query data and as such it can be used to manage ML models, too. We could also store metadata regarding features and weights in tables. One such example is to add columns or tables keeping track of different versions of a model built from different features subsets.

Interoperability

generates valid and readable TensorFlow/Python code. It can be combined with other code in Python, which is a popular language used among the data science community. In addition to this, TensorFlow code can run on GPUs, providing additional hardware options for training outside the database.

The line of work in is mainly relevant to the following three directions.

In-database machine learning There have already been efforts to bring machine learning inside the database. MADlib is a library of in-database methods for machine learning and data analysis. It provides SQL-based ML algorithms, which run on PostgreSQL or Greenplum database . SQL operators are combined with user defined functions (UDF) in Python and C++ implementing ML algorithms for popular data analysis tasks, such as classification, clustering and regression. Bismarck describes a unified architecture for integrating analytics techniques as user-defined aggregates (UDAs) to RDBMS. Similarly, Big Query , which is based on the Dremel technology , supports a set of ML models that can be called in SQL queries. In systems of this flavor, the user can select from a predefined collection of available algorithms. Moreover, code inside UDFs/UDAs remains a black box and is not optimized by the database system. In our approach we leave the user define the objective function of the algorithm, in order to provide more flexibility on the ML models she can develop.

A second direction regarding in-database learning aims at mingling the dataset construction with the learning phase and accelerating the latter by exploiting the relational structure of the data. F , and AC/DC operate on normalized data and rewrite objective functions of ML algorithms by pushing the involved aggregations through joins. This idea of decomposing the computations of ML algorithms comes under the umbrella of factorized ML. The purpose of these works is to showcase that given the appropriate optimizations a RDBMS can efficiently perform ML computations and the need for denormalization of the data can be avoided. Initial efforts , in this area required manual rewriting of each ML algorithm to its factorized version and because of that suffered from low generalizability. More recent work , aims to provide a more systematic approach to automatic rewriting of ML computations to their factorized version. Aspects of this line of work are orthogonal to . For example, since our workflow also starts in the database the optimization with functional dependencies presented in can also be exploited by our translation method in order to reduce the weights of a model. Another example is the use of lazy joins for more efficient denormalization of data as described in Morpheus . We argue though that by interoperating with existing ML frameworks, we gain in usability and reduce development overhead, as we can exploit helpful features, such as automatic differentiation and mathematical optimization algorithms.

Finally, propose an extension on SQL to support matrices/vectors and a set of linear algebra operators, whereas presents optimizations on executing recursion and large query plans on RDBMS, which can make them suitable for distributed machine learning. In we do not require any changes on the RDBMS or the machine learning framework. We rather propose a translation method that gets standard SQL code as input and generates valid TensorFlow/Python code.

Machine learning frameworks SystemML , TensorFlow , PyTorch , Mahout Samsara and BUDS provide domain specific languages (DSLs) and APIs that support linear algebra operations and data structures, probability distribution and/or deep learning functions, as well as useful ML-centric features, such as automatic differentation. These systems are more than ML libraries as they apply algebraic rewrites and operator fusion , to optimize users’ code. They employ an analogy of logical and physical plans similar to query optimization in database systems. The user essentially constructs a graph defining dependencies between operators and operands whose execution order is chosen by the system. Because of the emphasis on linear algebra, ML frameworks operate on denormalized data. targets SQL users who work with relational data and would like to perform more advanced analytics using machine learning.

Mathematical optimization on relational data To the best of our knowledge, MLog and SolverBlox , of the LogicBlox database are the closest systems to . Both systems model ML algorithms as mathematical optimization problems using queries that find the optimal values that minimize/maximize an objective function defined either in Datalog or a tensor-based declarative language similar to SQL. MLog translates the user’s code at first in Datalog and then in TensorFlow, which finally computes the optimal weights for the objective function. Similarly, SolverBlox, currently supporting only linear programming, translates Datalog programs to an appropriate format consumed by a linear programming solver. In we do not propose a new language for modeling ML algorithms, but rather we use SQL, an already popular declarative language. Regarding the supported ML algorithms, we expand outside linear programming as MLog does, but describe in detail a translation method directly from SQL to TensorFlow API without using any other languages for intermediate representations.

Experiments

We conducted two types of experiments: (1) the time translation takes to generate the entire TensorFlow program for three models defined in SQL code and various feature sets, (2) the time difference between exporting training observations with and without precomputed features.

Experimental setup Experiments ran on a machine with Intel Core i7-7700 3.6 GHz, 8 cores, 16GB RAM and Ubuntu 16.04. TensorFlow code ran on CPUs, unless stated otherwise. We use PostgreSQL as an RDBMS and whenever TensorFlow API is mentioned, it refers to the Python API.

Datasets Two datasets are used throughout the experiments: Boston Housing , and Favorita . Boston Housing is a small dataset including characteristics, such as per capita crime rate, pupil-teacher ratio, for suburbs in the Boston area and the median value of owner-occupied homes. The Favorita dataset is a real public dataset consisting of millions of daily sales of products from different stores of Favorita grocery chain. Dataset characteristics are displayed in Table [tab:datasets].

Dataset	Observations	Total Features	Numeric features	Categorical features
Boston Housing	506	13	13	0
Favorita	125497040	12	4	8

Translation time from SQL to TensorFlow

In Tables [tab:translation_time] and [tab:translation_time_normalized] we measure the time to translate the original SQL code provided by the user to TensorFlow code. For every translation we report average wall-clock time after five runs. The generated TensorFlow code includes all steps of the workflow, end-to-end, i.e. both code for exporting/importing data and defining the ML model. In Table [tab:translation_time] we provide translation time from SQL implementations of three models, linear regression, Factorization Machines and logistic regression having all features in a single table. We also report lines of SQL code, as well as lines of generated and handwritten TensorFlow code. SQL code includes both the select queries for defining the ML model and the commands for creating the tables used in its definition. We assume that queries are formatted reasonably in multiple lines, with select expressions, from, where and group by clauses put in separate lines. Features for training these models are based on the Boston Housing dataset.

We also report translation time from a linear regression model operating on different sets of normalized features, i.e. features stored in multiple tables, in Table [tab:translation_time_normalized]. Essentially, this latter experiment refers to the second case of denormalization described in section 12.5, where more rewriting steps are executed by . We focus on how translation time is affected by increasing the number of involved features. In Table [tab:translation_time_normalized] the sets of features are based on Favorita dataset. Product and store characteristics, such as the family of a product and the location of a store, can be used as categorical features for training a model. In Table [tab:translation_time_normalized] we assume that categorical features are converted to a one-hot encoding representation, where each value of a categorical feature becomes a numeric feature.

ML model	Lines of SQL code	Translation time (sec)	Lines of TF generated code	Lines of TF handwritten code
Linear Regression	30	0.0274	26	24
Factorization Machines	68	0.065	36	34
Logistic Regression	42	0.037	29	29

Feature tables	Total categorical features	Numeric 1-hot encoded features	Translation time (sec)
family, city	2	55	0.0388
family, city, state, store type	4	76	0.051
family, city, state, store type, holiday type, locale, locale_name	7	109	0.068
family, city, state, store type, holiday type, locale, locale_name, item class	8	446	0.0906

Takeaways Experiments show that translation time takes a few dozens millieseconds. Note that translation time is not affected by the number of training observations, only by the lines of SQL code and number of features, which affect the generation of the denormalization queries in Listings [pivot_query] and [pivot_query_family_city]. From the results in Table [tab:translation_time_normalized], we can also observe that for the same ML model, translation is more time-consuming when features are normalized across tables, but the increase remains within reasonable limits. The reason for this is that the translation algorithm needs to create mappings between feature/weight tables and the columns of the global feature/weight matrix that concentrate all values in a single place. Regarding the lines of code, we observe that there might exist a negligible difference between the automatically generated and handwritten code. This difference is due to the fact that generated code could be more verbose as we do not support the use of subqueries in SQL code. For example, in the least squared error formula $`\sum_{i=1}^{N}(\hat{y}_i-y_i)^2`$ the automatically generated version would compute the square of errors and their average in two operations coming from two separate SQL queries,

PYTHON

squared_errors = tf.square(tf.subtract(predictions, targets))
mean_squared_error = tf.reduce_mean(squared_errors, None)

        Click to expand and view more
      

whereas a Python developer would probably prefer to combine these steps in a single line, e.g.

PYTHON

mean_squared_error = tf.reduce_mean(tf.squared_difference(prediction, targets))

        Click to expand and view more

taking advantage of the square_difference operation of TensorFlow API. Similar differences might occur in case an operation is provided by TensorFlow API but not by SQL. An example is the computation of sigmoid function. TensorFlow provides tf.sigmoid, whereas in SQL the user needs to write the formula definition. As a result the generated code for this specific part might involve more operations than its handwritten version.

Another point where differences in code might occur is data loading. The automatically generated code follows the pattern of reading features and targets from two separate files/data structures, corresponding to the two tables/columns inside the database. A developer though might have other options for loading data. For example, all data might be exported in a single file and he might use Pandas operations to split them to two dataframes before loading them to tensors.

Evaluating feature precomputation

Using the feature reuse process described in section 12.6, we compare the time needed to export training data from the Favorita dataset with and without precomputing and materializing shared features in tables. Table [tab:feature_precomputation] shows time performance on exporting four different sets of features. The first three sets were exported on 80000000 observations. Only the last one was exported on 13870445 observations as holiday features of the Favorita dataset do not apply for most observations. Favorita observations are based on three dimensions: items, stores and dates. There are 4100 items, 54 stores and 1684 dates. When precomputation is used, we compute and store the values of each feature in a table, which we join with every observation during exporting. The naive version of the query computes every feature for every observation (see Listing [naive_query]) ignoring the fact that some features are shared among observations.

|P2cm|P2cm|P3cm|P3cm|P3cm| Observations & Features & Time without precomputation & Time with precomputation & Precomputation time
& 1 & 602.7 & 363.7 & 0.12
& 2 & 979.2 & 551.63 & 0.35
& 4 & 1331.7 & 630 & 0.475
& 8 & 355 & 122 & 0.754

Time performance results indicate that feature precomputation reduced exporting time on the Favorita dataset by about 50% while the time needed to compute and store features in tables remains low.

Correctness of generated code

To check correctness of the generated code, we ran experiments with two ML models on the Boston Housing dataset of Table [tab:datasets] and the epsilon dataset , and compared the loss function between their generated and handwritten versions. The epsilon dataset is a synthetic dataset of 400000 observations with 2000 features that can be used for binary classification as its observations are annotated with either 1 or -1. We trained a linear regression model on Boston Housing data that predicts the median value of owner-occupied homes using the rest of the characteristics as features. The learning rate used in gradient descent is 0.000003 and we ran 10000 iterations. The loss function is mean squared error. Figure 8 displays the loss function across iterations from both the handwritten and the generated code of the linear regression model. We observe that both curves follow the same decreasing pattern.

Loss of linear regression from both generated (left) and handwritten (right) code on Boston Housing dataset

Next we provide results from two logistic regression models on the epsilon dataset. Due to a hard size limit of 2GB in protocol buffer ⁴, TensorFlow cannot load the entire epsilon dataset at once. For this reason in order to train a logistic regression model, we need to perform mini-batching on the data, where each gradient descent iteration runs on a single mini-batch instead of the entire dataset.

On the SQL side, the synthetic nature of epsilon dataset is not well-suited to the relational structure. All features are real values with no specific meaning. For our experiment we used a highly denormalized schema with 4 tables, each storing 500 features, and a table for the labels, as displayed below.

SQL

features1(rowID: int, f1: doulbe, f2: double ... f500: double)
features2(rowID: int, f501: doulbe, f2: double ... f1000: double)
features3(rowID: int, f1001: doulbe, f2: double ... f1500: double)
features4(rowID: int, f1501: doulbe, f2: double ... f2000: double)
labels(rowID: int, v: int)

        Click to expand and view more
      

A query that joins all 4 tables and projects 2000 columns in order to gather all features together ran out of memory on a machine with 16GB. However, because TensorFlow poses a 2GB limit due to protocol buffer anyway, we can load the dataset in two halves using two SQL queries and create mini-batches from the first half at the beginning and the second half later on. We compare the loss function of the generated TensorFlow code using the above configuration with a handwritten version that loads the entire dataset from a file to a NumPy array, which is then divided into mini-batches at the TensorFlow side. For this experiment the learning rate in gradient descent was set to 0.01. We ran 300 epochs and set batch size to 200. Each epoch contained 2000 iterations to cover the entire training set. The loss function is logistic loss. Figure 9 displays the loss function across iterations from both versions of the logistic regression model. Due to the stochastic nature of mini-batching, there are small fluctuations between the two plots. However, overall the pattern is very similar.

Loss of logistic regression from both generated (left) and handwritten (right) code on epsilon dataset

📄 Read Full PDF on ArXiv

📊 논문 시각자료 (Figures)

A Note of Gratitude

The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.

sql4ml A Declarative End-to-End Workflow for Machine Learning

📝 Original Paper Info

📝 Abstract

💡 Summary & Analysis

📄 Full Paper Content (ArXiv Source)

Introduction

Bridging the gap

Common components in ML algorithms

Prediction and Objective Functions

Optimization

End-to-end workflow

ML algorithms in SQL

From SQL to TensorFlow API

Moving data from relations to tensors

Denormalization

Reuse of feature computations

Implementation

Overview of sql4ml

Conclusion and Future Work

Usability Benefits

Easier maintenance

Facilitating data handling

Model management

Interoperability

Experiments

Translation time from SQL to TensorFlow

Evaluating feature precomputation

Correctness of generated code

📊 논문 시각자료 (Figures)

A Note of Gratitude

Table of Contents

Table of Contents

📝 Original Paper Info

📝 Abstract

💡 Summary & Analysis

📄 Full Paper Content (ArXiv Source)

Introduction

Bridging the gap

Common components in ML algorithms

Prediction and Objective Functions

Optimization

End-to-end workflow

ML algorithms in SQL

From SQL to TensorFlow API

Moving data from relations to tensors

Denormalization

Reuse of feature computations

Implementation

Overview of sql4ml

Conclusion and Future Work

Usability Benefits

Easier maintenance

Facilitating data handling

Model management

Interoperability

Related work

Experiments

Translation time from SQL to TensorFlow

Evaluating feature precomputation

Correctness of generated code

📊 논문 시각자료 (Figures)

A Note of Gratitude

Related Posts

Mapping the Macroeconomy Topologically

Properties of Decentralized Consensus Technology -- Why Not Every Blockchain Is a True Blockchain

Game-Based Coalescence in Multi-Agent Systems

Start searching

No results found