Acknowledgement

We would like to thank Joseph Hellerstein and Zongheng Yang for the fruit-full discussions about DRL in query optimization. The advice given by Michael Grossniklaus about query optimization and Paul Bertucci about database tuning has been a great help. We also want to thank Katharina Rombach for the many discussions about reinforcement learning.

Query optimization is a well-studied problem in the data- base community with many different solutions proposed over the last decades. Pioneering work dates back to static query optimization of System R and the Vulcano Optimizer Generator , which has been widely used in commercial systems . Later, researchers introduced new architectures for query optimization, where queries are continuously optimized and validated during query processing . In 2001 IBM introduced the learning optimizer LEO for DB2, which is based on the architecture of static query optimization and is able to learn from its mistakes .

Lohman states in his blog post that query optimization is still a unsolved problem and pointed out that most query benchmarks used in research do not reflect databases in the real world. Leis et al. picked up on that thought and created a new query benchmark to demonstrate the issues of query optimizers in commercial DBMSes .

Recent progress in machine learning established new ways to tackle those open problems. For instance, some approaches try to mitigate the issue of cardinality estimation errors by introducing machine learning models to predict the costs or the execution time of queries .

Others apply more modern machine learning approaches leveraging recent advanced in reinforcement learning . The main idea of is to automatically tune a particular database and to improve the performance of query execution by using the feedback from past query executions. Oritz et al. studied how state representation affects query optimization when using reinforcement learning for static query optimization. The Skinner DB system uses reinforcement learning for continuous query optimization . In that approach, queries are dynamically improved on the base of a regret bounded quality measure. Most recently Marcus et al. introduced the end-to-end learning optimizer Neo which is inspired by AlphaGo . Neo uses a learned cost model based on neuoral networks (NN) to guide a search algorithm through the large search space of all possible query plans. The approach was motivated by the NN-guided Monte Carlo Tree Search which got famous through AlphaGoZero.

For our paper, we focus on Q-learning and policy gradient based reinforcement learning, which got popular after the publication of the Atari paper by DeepMind . The closest works to ours are ReJoin and DQ . ReJoin uses policy gradient methods to find a good policy for join order enumeration, whereas DQ focus on a Deep Q-Network architecture to optimize their query plans. Both papers rely on traditional cost models to find optimal query plans – which is suboptimal in terms of erroneous cardinality estimations. However, DQ introduces a mechanism to fine-tune their Deep Q-Network on the query execution times to mitigate this issue. The advantage of Deep-Q-Networks and policy gradient methods over the NN-guided plan search of Neo is the shorter training time.

The key advantage of our approach over all other approaches is that we provide a generalized architecture, which allows to use different front line RL algorithms with minimal effort. FOOP places the query optimizer in the application layer which makes the approach DBMS independent. Further, we introduce query optimization as a fully observed RL problem, which allows the RL algorithms to learn new aspects about the query execution performance which might be difficult to achieve with traditional cost-based optimizers.

Architecture

In this section we introduce FOOP - a Fully Observed Optimizer - that uses RL for query optimization. In a first step, we show how to model query optimization as a Markov Decision Process (MDP). Secondly, we discuss how to represent queries and database information as feature vectors that can be used for deep reinforcement learning. The major research challenge is to find a good feature representation as well as the right reinforcement learning algorithm such that the learning process produces a model which generalizes well and does not get trapped in local optima or never stabilizes.

Modeling

To express a problem as an RL problem, we need to formulate it as an MDP, which consists of a five-tuple, as mentioned in Equation [eq:mdp] in Section 12.2.1. For the sake of readability, we repeat the specification here:

\begin{equation}
\langle S,A,P(s,a),R(s,a),S_0 \rangle
\end{equation}

We will now walk through each of these components and describe them with concrete examples based on our sample database shown in Figure [fig:sample_db_er] and the query ’$`P \bowtie OI \bowtie O \bowtie C`$’:

$`S`$ (states): The states are all possible (sub)query plans. The complete query plans are the terminal states. Since FOOP is fully observed, we show all involved relations at any time. For our sample query, a sub set of all states is as follows:
$`S={ [P;OI;O;C], [(P,OI);O;C], [(OI,P);O;C], ...}`$

In the example above, every state is represented by square brackets “[]”. For instance, [P;OI;O;C] is the initial state. Each relation that is not joined yet, is separated with a semicolon. The second state is [(P,OI);O;C] where the parentheses “()” indicate a sub query plan.
$`A`$ (actions): The actions are all possible joins included in all query plans. Hence, the total action space is a list with a size larger than $`n!`$ where $`n`$ is the number of relations contained in all queries. For our running example, the action space is as follows:

$`A = (P \bowtie OI), (OI \bowtie P) , (OI \bowtie O) , (O \bowtie C) ,\\ (C \bowtie O) , (C \bowtie B), ...`$

In Section 9.2.2 we will discuss a potential simplification to reduce the complexity of the action space.
$`S_{t+1} \sim P(s,a)`$ is the new state you reach when you are in state s and take action a. Assume
$`s = [(P,OI);O;C]`$ and $`a=(C \bowtie O)`$. Then the new state is $`P(s,a) = [(P,OI);(C,O)]`$.
$`R(s,a)`$ is the reward being in state s while taking action a: The rewards are the negative costs of the resulting query plan. The costs are evaluated only in a terminal state. In other words, FOOP only receives the costs for the final query plan and not for sub query plans. We introduce the cost model and the reward handling in Section 9.2.3.
$`S_0`$ for initial state: The initial state is: $`[P;OI;O;C]`$.

Featurization

RL learning algorithms need an environment in which they can interact with the MDP as explained in Section 12.2.1. We will now describe all the components of the agent-environment that are necessary to solve a query optimization problem with reinforcement learning.

Observation/State

An observation represents the state in which the agent currently is. Since we want to provide as much information as possible to our learning algorithms, we formulate query optimization as a fully observed RL problem. Hence, the information of the database and the respective queries have to be encoded in such a way that the representation can be learned by the RL algorithms. The encoded observation serves as input for a neural network (NN) in order to learn from the observations.

For the encoding we followed partly the idea of Krishnan et al. , where each column of the database is used as a single feature. A state is represented as a binary one-hot vector where every digit represents a column of the database. The size of the vector corresponds to the number of all columns over all tables. This vector is used twice (see left example in Figure 1): The first vector represents the original query and marks which columns are involved in the query. Since the query involves all tables, all columns are set to 1. The second vector represents the state of the sub query plan. In initial state $`S_{0}`$ all columns are set to zero.

The next step is to perform an action. In this case, action $`A_{0}`$ suggests to execute the sub query $`C \bowtie O`$. As a consequence, state $`S_{1}`$ needs to be updated. In particular, all columns that are involved in this join need to be set to 1 in the second vector. Afterwards action $`A_{1}`$ is executed and state $`S_{2}`$ needs to be updated, etc.

Our proposed approach FOOP extends that idea as you can see on the right side in Figure 1. Instead of just having a vector that represents the currently joined sub query, we create a symmetric matrix. This matrix represents a table or a sub query in each row. All rows of the matrix together include the needed tables for the final query. With this matrix, we represent the whole database during the process of query execution.

In the initial state $`S_0`$, the matrix represents all necessary tables for the query in a separate row. For instance, row 0 has the first three bits set to represent table P. Row 1 has the bits 4 to 6 set, to represent table OI, etc.

Then action $`A_{0}`$ performs the sub query $`C \bowtie O`$, which involves the tables O and C that are represented by the vector [3 2].

Next, state $`S_{1}`$ needs to be updated. We can see that rows 0 and 1 still represent the tables $`P`$ and $`OI`$, i.e. the respective columns are set to 1. However, row 3 contains the result of the join $`(C \bowtie O)`$. Due to the fact that the input vectors of NNs always have to have the same size, we are forced to keep row 2 as an empty row.

In every further step we add a table to the sub query, until there is just the final query left as presented in the last step $`S_3`$.

Observation featurization in DQ by Krishnan et al. and FOOP presented on the sample query ’P ⋈ OI ⋈ O ⋈ C’ using the sample database shown in Figure [fig:sample_db_er].

The advantage of our new featurization approach is as follows: The reinforcement agent knows at any state which sub queries still have to be joined to reach a terminal state. In addition, the agent also sees which sub queries were already joined in the past.

Actions

We will now discuss how agents take actions and thus provide details on how we construct the action space. To create our action space, we can come back to our observation matrix. Every row of the observation matrix is a join candidate, as you can see in state $`S_0`$ of Figure 2. To construct the action space, we take all combinations of all join candidates for joining two tables.

This results in an action space of $`n*(n-1)`$, where $`n`$ is the total number of tables in the database (as shown in the lower half of Figure 2). For instance, row 0 represents the join $`P \bowtie OI`$. Row 1 represents the join $`P \bowtie O`$, etc. In our example, we assume that row 11 (which is highlighted in yellow) is selected by a random agent.

As mentioned in Section 9.2.1 there are empty rows in the observation matrix, during the query optimization process. This means, not all rows in our observation matrix are valid join candidates. Due to that, we have invalid actions (highlighted in light gray) in our action space, as you can see in the right lower part of Figure 2. All these actions are invalid since they would join row 2, which is an empty row and thus not a join candidate.

Having invalid actions is a common issue in DRL. We solved that issue by creating an action masking layer in the NN, similar to the approach of Lee et al. for playing autonomously StarCraft II . The basic idea is that the output layer of the NN is multiplied with an action mask. This results in a new output, where the values of the invalid actions are set to zero. The action mask has the size of all possible actions (valid actions are represented with a 1 and invalid actions with 0).

Reward

We will now discuss how agents receive a reward when executing an action. A reward is defined as the negative costs of a query plan. Unfortunately, it would be too time consuming to execute every planed query on a DBMS. Therefore, we need to fall back to cost models of traditional query optimizers. Even though these cost models are sometimes erroneous, they serve as good priors for training a machine learning model. We will now introduce the cost model and show how we integrate it into the agent environment.

Cost Model: We decided to take the cost model, which was introduced in :
```
\begin{equation}
C(Q) = 
\left \{
    \begin{tabular}{l}
    $\tau * |R|$; \\
    ...  $if Q = R$ \\
    $|Q|+C (Q_l)+C (Q_r)$; \\ 
    ...  $if Q = Q_l \bowtie^{hj} Q_r$ \\
    $C(Q_l)+\lambda*|Q_l|*max(\frac{|Q_l \bowtie R|)}{|Q_l|},1)$; \\
    ...  $if Q= Q_l \bowtie^{ij} Q_r,Qr = R$\\
    \end{tabular}
\right \}
\label{eq:costmodel}
\end{equation}
```
In the Equation [eq:costmodel] above,
- R stands for a base relation.
- Q stands for a (sub)query, where $`Q_l`$ is the left side of a join and $`Q_r`$ the right side.
- $`\tau \leq 1`$ is a parameter, which discounts a table scan compared to a join.
- $`\lambda \geq 1`$ is a constant to approximate by how much an index lookup is more expensive than a hash table lookup.
- $`|\hspace{0.2cm}|`$ stands for the cardinality estimation function.
- $`hj`$ and $`ij`$ stand for hash join and index nested loop join, respectively.
The cost model is tailored for main-memory databases. That means it only measures the number of tuples that pass through each operator and it does not take I/O costs into account. Further, it only distinguishes between hash joins (hj) and index nested loop joins (ij). This cost model is very simplistic compared to cost models of commercial DBMSes. Nevertheless it performs very similar to the cost model of PostgreSQL as Leis et al. pointed out. We set the constants according to the paper ($`\tau=0.2`$, $`\lambda=2`$) .
Reward Mapping: As mentioned above, the negative costs are the reward for our RL model. The cost model applied on a large database produces cost values in the range of $`10^6`$ to $`10^{18}`$. DRL methods usually operate with reward values in the range (-10,10). Hence, we need to normalize the cost values to a usable range for the RL algorithms. We use a square root function shown in Equation [eq:rewardmapping] to normalize the cost values. Linear normalization did not work well, due to the distribution of the cost values.
```
\begin{equation}
N(Q) = \frac{\sqrt{C(Q)}}{\sqrt{upperbound}}*-10
\label{eq:rewardmapping}
\end{equation}
```
Since good query plans are just a small fraction of the whole space of all possible query plans, we clip our reward space. In Figure 3 you see the mapping of the cost values to the reward. The square root function is applied on all cost values lower than $`10^{13}`$, with $`10^{13}`$ as the upperbound. The cost values bigger than $`10^{13}`$ are clipped to the min reward of $`-10`$.

Mapping of the costs values to the reward scale (-10,0).

Evaluation

In this section, we present and discuss the results of the experiments using various reinforcement learning techniques for query optimization. In particular, we will address the following research questions:

How effective are Q-learning and policy gradient-based reinforcement learning approaches for query optimization?
Which of these approaches shows the best performance?
Can we even further improve the above mentioned approaches by applying ensemble learning techniques?

Experimental Setup

All experiments were performed on a laptop running Ubun- tu version 18.04.1 with a 4-core Intel Core i7 8650U CPU (1.90-4.20 GHz), 16 GB of RAM and no GPU module.

We implemented our approach¹ as an extension of gym from OpenAI in Python. This allows using the RL interface from OpenAI and the baseline RL algorithms provided by Ray RLLib . The NN models are written with Tensorflow . Our query optimizer is located in the application layer, so we are independent from the DBMS. To calculate the expected costs as explained in Section 9.2.3 we use the cardinality estimator from PostgreSQL 10.6 . Furthermore, for our end-to-end performance experiment we use PostgreSQL 10.6 as the query execution engine.

For the evaluation we use the Join Order Benchmark (JOB) introduced by Leis et al. . The benchmark consists of queries with 3 to 16 joins, with an average of 8 joins per query. There are 33 different query structures. Each structure exists in 2 to 6 different variants, which results in a query set of 113 queries. All queries are realistic in a sense that they answer a question of a movie enthusiast and are not constructed to ’trick’ a query optimizer .

JOB is based on the IMDB data set , which is freely available. The IMDB is a real-world data set consisting of 21 tables. It contains information about movies and movie related facts like actors, directors etc. We use the same snapshot from May 2013 as Leis et al. do. The data set comprises 3.6 GB of highly correlated and non-uniformly distributed data. The largest tables are: $`cast\_info`$ with 36 million rows and $`movie\_info`$ with 15 million rows .

If not further specified, all results are presented on all 113 JOB queries. To ensure that the performance is evaluated only on queries which have not been seen by the model during training, we use the 4-fold cross validation introduced by Krishnan et al. . Each of the 113 queries occurs at least in one test set. The train and test sets consist of 80 and 33 queries, respectively.

Evaluation of RL algorithms

In the first part of the evaluation we will analyze the performance of different RL algorithms. We will start with deep Q-networks (DQN) and discuss some enhancements. Afterwards we will analyze proximal policy optimization (PPO).

Deep-Q-Networks (DQN)

Let us start with the evaluation of the DQNs, which we introduced in Section 12.2.2. We begin with the vanilla DQN, which was used in the DQ paper and that we have implemented and configured according to the information provided in the paper.

The most important hyper-parameter values of the vanilla DQN can be found in Table [tbl:configVanillaDQN]. The full set of parameter values can be found on our github repository². The vanilla DQN is trained over 4000 iterations. The learning starts after time step 1000 and the target network gets updated every 500 time steps. For this model we use 2-step Q-learning and a neural network with two hidden layers with 256 neurons each.

As introduced in Section 9.2.3, our DQN uses the expected costs calculated by the cost model Equation [eq:costmodel] as the negative reward for each query. We only give the DQN a reward signal after planing the full query, to reach a stable learning process and to give the DQN a chance to learn.

Parameter	Value
Training iterations:	4000
Learning starts (in time steps):	1000
Target network update (in time steps):	500
n-step Q-learning:	2
Hidden layers:	[256 256]

The results on the left side of Figure 4 show the estimated cost values trained on 80 queries and tested on 33 queries. The cost value is computed based on the cost model shown in Section 4, see Equation [eq:costmodel].

We can see that the average cost value for the deep-Q-network (DQN) is around 0.2*1e10 (see left side of Figure 4). However, the minimal and maximal value range between 0.1*1e10 and 3.5*1e10, which indicates a high variance in the estimated query costs resulting in expensive query plans for about half of the observed queries.

In order to reduce the high variance of the estimated costs and to reach a more stable optimization solution, we extended the vanilla DQN. In particular, we extended the vanilla DQN model with double DQN and priority replay. In addition, we used a larger neural network (NN) to achieve a higher abstraction level for finding an optimal Q-function. The configuration of the DDQN is listed in Table [tbl:doubleDeepQ]. The DDQN is trained over 40,000 iterations. The learning starts after time step 160,000 and the target network gets updated every 32,000 time steps. For this model we use 2-step Q-learning as well and a neural network with two hidden layers of 6,272 and 1,568 neurons each.

The results of the double deep-Q-network (DDQN) with priority replay are shown on the right side in Figure 4.

Parameter	Value
Training iterations:	40,000
Learning starts (in time steps):	160,000
Target network update (in time steps):	32,000
n-step Q-learning:	2
Hidden layers:	[6272 1568]

Cost values for optimizing 113 queries using reinforcement learning algorithms. The figure compares a vanilla deep-Q-network (DQN) with a double deep-Q-network (DDQN) using priority replay.

Figure 4 shows the cost spread off all 113 queries. Even though the median cost value (green line) of DQN and DDQN is very similar, the inter-quartile range (blue box) of the vanilla DQN is by a factor of two larger than the one of the DDQN. In addition, the maximum cost value of the vanilla DQN is by a factor of two larger than the maximum of the DDQN. The minimum values of the vanilla DQN and DDQN are similar. In short, the Q-function of DDQN produces far less expensive query plans than the vanilla DQN.

Proximal Policy Optimization (PPO)

In the second step we evaluate proximal policty optimization (PPO), which is used in ReJoin . We have implemented that approach using the reinforcement learning algorithms provided by RLLib of Ray³. The basics of PPO are introduced in Section 12.2.3. Unfortunately, the ReJoin paper does not provide the configuration of the used PPO model. Due to that we created and tuned our own configuration, which you can find in Table [tbl:PPO]. The full set of parameter values can be found on our github repository⁴. The PPO is trained over 200,000 iterations. During the training process we clip policies with a bigger deviation than 0.3. Furthermore, we us for our model a neural network with two hidden layers of 256 neurons.

Parameter	Value
Training iterations:	200,000
Clipping coefficient ($`\epsilon`$):	0.3
Hidden layers:	[256 256]

PPO outperforms both DQN configurations, as presented on the left side of Figure 5. PPO is able to reduce the inter-quartile range and the maximum by more than a factor of two compared to DDQN.

Comparing three different DRL algorithms in FOOP a) on a random training data, b) after adapting the training and test data sets and c) with Ensemble Learning

Enhance RL models

In this section we will to evaluate how we can improve the DRL models. First, we will discuss how the split of training and test data affects the learning process. Afterwards we introduce ensemble learning to improve the DRL-based query optimizer.

Re-Arranging Training and Test Data Sets

In our previously presented experiments, we used a random data split for generating training and test sets. As Krishnan et al. pointed out, it is important that the training queries cover all relations and use all important relationships in the database. So we introduce a new training/test data split, which can be found in our github repository⁵. The key requirements for the data split are:

All relations have to be represented in every training set.
The training sets have to contain all possible join conditions.
To have accurate test conditions, every test set has to contain the largest possible variety of different join queries.

We ran our experiments with the new data split for the training and test set. The results are shown in the middle of Figure 5. As we can see, the new arrangement of the training and test data improves the query optimization performance of all models. Especially the inter-quartile range and the maximum of DQN and DDQN can be reduced significantly. Nevertheless, PPO still outperforms the DQN models. The better performance can be explained by the new training sets, which cover all relations and join conditions of the database. This enables the DRL-based optimizer to reduce overfitting of the Q-function and policy-function.

Ensemble Learning

In RL the training process is a stochastic process. This means that the learned policies from different training runs can have divergences resulting in different optimal query plans for the same query. Ensemble learning is a method, which helps us to transform that issue into a benefit. Ensemble learning is often used to combine multiple learning algorithms, with the aim to create better predictions than a single learning algorithm. In other words, the agent asks for a second or a third opinion before taking a decision .

To apply ensemble learning, we first train every DRL model five times and thus receive five different learned policies. Afterwards use all five learned policies to optimize a given query. As a result, we will receive five different “optimal” query plans. Finally, we chose the query plan with the lowest costs.

The results of using three different DRL-based query optimization with ensemble learning are shown on the right side of Figure 5. We are able to reduce the inter-quartile range and the maximum of all models significantly. In addition, we were even able to lower the median of all three models. It is not surprising, that we get better results, since we take just the best query plans from five different learned policies. Moreover, we could also reduce the amount of outliers.

Comparison to other Approaches

So far we compared the DRL algorithms introduced in ReJoin and DQ . However, for DQ we used the featurization of FOOP. The advantage of FOOP’s featurization is that the learning algorithms have more information about the query execution process resulting in a smaller action space.

Now we will compare these approaches to a traditional query optimization approach based on dynamic programming to better understand the advantages and disadvantage of DRL-based approaches.

Dynamic Programming with Left-Deep Plans

DP algorithms were introduced to query optimization in System R and are still widely used in commercial DBMSes. The basic assumption of these algorithms is that the optimal sub query plan is part of the optimal query plan. The algorithm therefore compares sub queries, which have the same outcome, and picks the sub query with the lowest costs to proceed. However, this process is very memory- and computing-intensive, especially for large queries.

Typical approaches use a recursive process which grows on a lower bound of $`n!`$ where $`n`$ represents the number of relations in the query. To reduce that complexity, System R introduces rules like ’left-deep query trees only’ and ’no cross joins’. We implemented the bottom-up DP algorithm from System R with restrictions to left-deep trees. The algorithm can be found in our github repository. Note that this approach is still very compute-intensive. Hence, DBMSes like PostgreSQL limit the size of queries processed by dynamic program (e.g. 8-way-joins). When queries contain more than 8 relations to join, the optimizer greedily selects the remaining relations or uses genetic algorithms for optimization.

In Figure 6 we compare the three previously discussed DRL-based optimization strategies against dynamic programming (DP). As we can see, DP outperforms the DRL-based approaches. However, the optimization run time complexity for, e.g. PPO, is much better than for dynamic programming as we will show in Section 10.5.

Query-by-query comparison of FOOP with PPO and DP left-deep

Let us now take a closer look at PPO when compared to dynamic programming. In particular, we will analyze the join complexity of the queries. The x-axes of Figure 7 shows the queries sorted by number of joins ranging from 3 to 16. We can see that for a large number of queries PPO even outperforms dynamic programming. However, PPO has about 6 outliers that decrease the overall performance. We will now analyze these outliers in more detail.

Outlier Analysis

During the analysis of the outliers we observed that all 6 outlier queries come from two different query templates of the JOB benchmark. In particular, the outlier queries belong to the templates 25 and 31. These outliers can be explained by the occurrence of the different tables in the training set versus the test data set. Figure 8 shows the occurrence of the tables as a heatmap. The color red indicates that the respective table is heavily used, i.e. it is “hot”, while the color blue indicates that the respective table is hardly used, i.e. it is “cold”. We can observe that for outlier queries there is a higher number of “hot” tables in the test data, while the respective tables in the training data tend to be “colder”. In other words, the test data does not contain enough training examples and hence these queries could not be learned well.

The heatmap to shows how often particular tables are accessed by the queries in the training set, test set and the outliers.

Training Time

In contrast to traditional query optimizers, DRL-based query optimizers initially need to be trained before they can be used. In this section we describe the training latency of the different DRL-models used in FOOP. The training process of vanilla DQN over 4,000 time steps and DDQN over 40,000 time steps takes about 7 minutes each. The PPO model is trained over 200,000 time steps and takes roughly 25 minutes.

The training process in FOOP with ensemble learning increases by a factor of five, since we train each model five times. The DQN models take 45 minutes and the PPO model even needs 125 minutes to complete the training process. In comparison, ReJoin needs 3 hours to complete the training process on a slightly smaller hardware configuration.

One of the biggest advantage of query optimization with DRL is, that the optimization latency is linear to the used relations of a query , which results in a complexity of $`O(n)`$. The latency grows linear with the join size, even if we use ensemble learning to improve our query plans, as presented in Figure 9. Left-deep DP, on the other hand, has an optimization latency which is factorial to the relations of the query, this corresponds to the complexity of $`O(n!)`$.

Comparing optimization latency of FOOP with DP left-deep DP.

Introduction

Query optimization has been studied over decades and is one of the key aspects in database management systems (DBMS). However, one of the biggest challenges in query optimization is join order optimization, i.e. to find the optimal order of executing joins such that the query costs and hence the query execution time is minimized . Since join order optimization belongs to the class of NP-hard problems , exhaustive query plan enumeration is a prohibitive task for large databases with multi-way join queries. Hence, query optimization is often considered as a trade-off between the quality of a query plan and the time spent optimizing it.

Query optimization strategies of many commercial DBMSes are based on ideas introduced in System R or in the Volcano Optimizer Generator . These systems use dynamic programming (DP) combined with a set of rules to find good query plans. These rules prune the search space of potential query plans, which reduces the time spent optimizing the query but also lowers the chance that the optimal query plan is found in the large search space. Traditional query optimizers suffer from a second issue besides the limitation of the search strategies. They rely on cost models to estimate the cost of executing a query. These cost models are built on cardinality estimations which are based on quantile statistics, frequency analysis or even non-theoretically grounded methods . Errors in the cardinality estimation often lead to suboptimal query plans. Moreover, traditional query optimizers do not learn from prior executed queries. Even though concepts of learning optimizers are around since LEO , these approaches have not been widely adopted. As the work of Leis et al. shows, there is a need for data-adaptive learning query optimizers for large analytical databases.

Recent success in deep reinforcement learning (DRL) has brought new opportunities to the field of query optimization. For example, ReJoin and DQ propose their approaches to use DRL to optimize join queries. Both papers apply different DRL algorithms in their query optimizers. However, there is no generic query optimization framework which allows studying different machine learning algorithms and hence enables a direct comparison between different methods.

In this paper we introduce a DRL-based Fully Observed Optimizer (FOOP) for databases. FOOP allows reinforcement learning (RL) algorithms to track all relations and intermediate results during the query optimization process, similar to the observations during a game of Go . FOOP is inspired by ReJoin and DQ and enhances them with the most recent modules introduced in DLR research like Double Deep Q-Networks and Priority Replay. We show that FOOP produces query plans with similar quality to traditional query optimizers using significantly less time for optimization.

The paper makes the following contributions:

With FOOP we introduce query optimization as a fully observable RL problem, which allows RL algorithms to track all relations and intermediate results during the optimization process. We place FOOP in the application layer, which makes the optimizer DBMS independent.
To the best of our knowledge, this paper presents the first face-to-face comparison of DRL algorithms vanilla Deep Q-Network, Double Deep Q-Network with Priority Replay and Proximal Policy Optimization in query optimization.
FOOP produces query plans with cost estimations in a similar range to traditional query optimizers by using optimization algorithms with significantly lower runtime complexity.

The paper is organized as follows. Section 7 reviews the literature in the areas of query optimization. Section 12 provides the background knowledge about query optimization and the basics of reinforcement learning. Section 9 introduces the architecture and the featurization of our learning optimizer FOOP. Section 10 provides a detailed evaluation of the experiments with FOOP. Finally, we present our conclusions and future work in Section 11.

Conclusion and Limitation

In this paper we analyzed various deep reinforcement learning algorithms for query optimization and introduced FOOP - a deep reinforcement learning-based Fully Observed Optimizer. FOOP is currently limited to join order optimization but can easily be extended to optimize select-project-join queries or even non-relational operators as presented in DQ . FOOP is implemented in an RL-friendly environment which enables experimenting with the most cutting-edge RL algorithms.

Our experimental evaluation shows that Proximal Policy Optimization (PPO) reinforcement learning algorithms outperform Deep Q-Networks (DQN) in the task of join order optimization. We also suggest to use ensemble learning in combination with DRL to mitigate stability issues of the stochastic training process. The results demonstrate that ensemble learning significantly improves the overall performance of the RL algorithms. FOOP produces query plans with cost estimations in a similar range to traditional query optimizers based on left-deep dynamic programming algorithms. However, the optimization latency of FOOP is significantly lower than for left-deep DP algorithms.

Based on our experiments, we see the following avenues of further research:

One of the disadvantages of deep reinforcement learning is that pre-training certain models is very compute-intensive. One solution to overcome this problem is to integrate bootstrapped expert knowledge into the pre-training phase for faster learning.
Some of the cost models of traditional query optimizers suffer from erroneous cardinality estimations. To solve this problem, one could add a fine-tuning step to the training process to reuse cardinality data from already executed queries.
In order to provide an end-to-end query optimizer, FOOP needs to be extended with further functions like selections, projections as well as aggregations.