Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning Approach

Due to the outstanding capability of capturing underlying data distributions, deep learning techniques have been recently utilized for a series of traditional database problems. In this paper, we investigate the possibilities of utilizing deep learni…

Authors: Yaoshu Wang, Chuan Xiao, Jianbin Qin

Monotonic Cardinality Estimation of Similarity Selection: A Deep   Learning Approach
Monotonic Cardinality Estimation of Similarity Selection: A De ep Learning Approach Y aoshu W ang Shenzhen Institute of Computing Sciences, Shenzhen University yaoshuw@sics.ac.cn Chuan Xiao Osaka University & Nagoya University chuanx@ist.osaka- u.ac.jp Jianbin Qin ∗ Shenzhen Institute of Computing Sciences, Shenzhen University jqin@sics.ac.cn Xin Cao The University of New South W ales xin.cao@unsw .e du.au Yifang Sun The University of New South W ales yifangs@cse.unsw .edu.au W ei W ang The University of New South W ales weiw@cse.unsw .edu.au Makoto Onizuka Osaka University onizuka@ist.osaka- u.ac.jp ABSTRA CT Due to the outstanding capability of capturing underlying data distributions, deep learning te chniques have been re- cently utilized for a series of traditional database problems. In this paper , we investigate the p ossibilities of utilizing deep learning for cardinality estimation of similarity sele ction. An- swering this problem accurately and eciently is essential to many data management applications, especially for quer y optimization. Moreover , in some applications the estimated cardinality is supposed to b e consistent and interpretable. Hence a monotonic estimation w .r .t. the quer y threshold is preferred. W e propose a novel and generic method that can b e applied to any data type and distance function. Our metho d consists of a featur e extraction mo del and a regression model. The feature e xtraction model transforms original data and threshold to a Hamming space, in which a deep learning- based regression model is utilized to exploit the incremental property of cardinality w .r .t. the threshold for b oth accuracy and monotonicity . W e develop a training strategy tailored to our model as well as techniques for fast estimation. W e also discuss how to handle updates. W e demonstrate the accuracy and the eciency of our method through experiments, and show how it improves the performance of a query optimizer . CCS CONCEPTS • Information systems → Query optimization ; Entity res- olution ; • Computing methodologies → Neural networks . KEY W ORDS cardinality estimation; similarity selection; machine learning for data management ∗ Corresponding author. 1 IN TRODUCTION Deep learning has been recently utilized to deal with tradi- tional database problems, such as indexing [ 43 ], quer y exe- cution [ 23 , 42 , 61 , 70 ], and database tuning [ 81 ]. Compared to traditional database methods and non-deep-learning mod- els ( logistic regression, random forest, etc.), deep learning exhibits outstanding capability of reecting the underlying patterns and correlations of data as well as exceptions and out- liers that capture the extreme anomalies of data instances [ 42 ]. In this pap er , we explore in the direction of applying deep learning techniques for a data management pr oblem – car- dinality estimation of similarity selection, i.e., given a set of records D , a quer y record 𝑥 , a distance function and a thresh- old 𝜃 , to estimate the number of records in D whose distances to 𝑥 are no greater than 𝜃 . It is an essential procedure in many data management tasks, such as search and retrieval, data integration, data exploration, and query optimization. For ex- ample: (1) In image retrieval, images are converted to binary vectors (e.g., by a HashNet [ 15 ]), and then the vectors whose Hamming distances to the query are within a thr eshold of 16 are identied as candidates [ 82 ] for further image-level verication (e.g, by a CNN). Since the image-level verication is costly , estimating the cardinalities of similarity selection yields the numb er of candidates, and thus helps estimate the overall running time in an end-to-end system to create a ser- vice le vel agr eement. (2) In quer y optimization, estimating cardinalities for similarity sele ction benets the computa- tion of op eration costs and the choice of execution orders of query plans that involve multiple similarity predicates; e.g., hands-o entity matching systems [ 20 , 28 ] extract paths from random forests and take each path (a conjunction of similarity predicates over multiple attributes) as a blocking rule. Such query was also studied in [49] for sets and strings. 0 4 8 12 16 Distance 10 0 10 1 10 2 10 3 Cardinality Quer y 1 Quer y 2 Quer y 3 Quer y 4 Quer y 5 (a) Cardinality v .s. threshold. 10 0 10 1 10 2 10 3 Cardinality 10 − 4 10 − 3 10 − 2 10 − 1 10 0 P ercentage of Quer ies Threshold = 4 Threshold = 8 Threshold = 12 Threshold = 16 (b) Percentage of queries v .s. cardinality . Figure 1: Cardinality distribution on ImageNet. The reason why deep learning approaches may outperform other options for cardinality estimation of similarity selec- tion can be seen from the following example: (1) Figure 1(a) shows the cardinalities of ve randomly chosen queries on the ImageNet dataset [ 1 ] by var ying Hamming distance thresh- old. The cardinalities ke ep unchanged at some thresholds but surge at others. (2) Figure 1(b) shows the p ercentage of queries (out of 30,000) for each car dinality value, under Ham- ming distance thresholds 4, 8, 12, and 16. The cardinalities are small or mo derate for most queries, yet exceptionally large for long-tail queries (on the right side of the gure). Both facts cause considerable diculties for traditional database methods which require large samples to achieve good accu- racy and traditional learning metho ds which are incapable to learn such complex underlying distributions. In contrast, deep learning is a goo d candidate to capture such data patterns and generalizes well on queries that are not covered by training data, thereby delivering better accuracy . Another reason for choosing de ep learning is that the training data – though large training sets are usually neede d for de ep learning – are easily acquired by running similarity sele ction algorithms (without producing label noise when exact algorithms are used). In addition to accuracy , there are several other technical issues for cardinality estimation of similarity selection: (1) A good estimation is supp osed to b e fast . (2) A generic method that applies to a variety of data types and distance functions is preferr ed. (3) Users may want the estimated cardinality to be consistent and interpretable in applications like data exploration. Since the actual cardinality is monotonically increasing with the threshold, when a greater threshold is given, a larger or equal number of results is pr eferable, so the user is able to interpret the cardinality for better analysis. T o cop e with these technical issues, we propose a nov el method that separates data mo delling and cardinality estima- tion into two comp onents: • A featur e extraction component transforms original data and thresholds to a Hamming space such that the semantics of the input distance function is exactly or approximately captured by Hamming distance. As such, our method be- comes generic and applies to any data type and distance. • A regression component models the estimation as a regres- sion problem and estimates the cardinality on the trans- formed vectors and threshold using de ep learning. T o achieve good accuracy of regression, rather than feeding a deep neural network with training data in a straightfor ward manner , we devise a novel approach based on incremental prediction to exploit the incremental property of cardinality; i.e., when the threshold is increasing, the increment of cardi- nality is only caused by the records in the increased range of distance. Since our feature extraction maps original distances to discr ete distance values, w e can use multiple regressors, each dealing with one distance value, and then sum up the individual results to get the total cardinality . In doing so, we are able to learn the cardinality distribution for each distance value, so the overall estimation becomes more accurate . An- other benet of incremental prediction is that it guarante es the monotonicity w .r .t. the threshold, and thus yields more interpretability of the estimated results. T o estimate the car- dinality of each distance value, we utilize an encoder-de coder model through careful neural network design: (1) T o cope with the sparsity in Hamming space , as output by the feature extraction, we employ a variational auto-encoder to embed the binar y vector in Hamming space to a dense representation. (2) T o generalize for queries and thresholds not covered by the training data, we also emb ed (Hamming) distance values. The distance embeddings are concatenated to the binary vector and its dense representation, and then fe d to a neural network to produce nal embeddings. The deco ders takes the nal embeddings as input and outputs the estimate d cardinality . W e design a loss function and a dynamic training strategy , both tailored to our incremental prediction model. The loss function adds more loss to the distance values that tend to cause more estimation error . The impact of such loss is dynam- ically adjusted through training to improve the accuracy and the generalizability . For fast online estimation, optimizations are developed on top of our regression model by reducing mul- tiple enco ders to one. As we are applying machine learning on a traditional database problem, an important issue is whether the solution works when update exists. For this reason, we discuss incremental learning to handle updates in the dataset. Extensive experiments were carrie d out on four common distance functions using real datasets. W e took a uniform sample of records from each dataset as a quer y workload for training, validation, and testing, and computed labels by run- ning exact similarity selection algorithms. The takeaways are: (1) The proposed deep learning method is more accurate than existing methods while also running faster with a moderate model size. (2) Incremental prediction guarante es monotonic- ity and at the same time achieves high accuracy , substantially outperforming the method that simply fe eding a deep neural 2 network with training data. (3) The components in our mo del are all useful to impr ove accuracy and speed. (4) Incremental learning is fast and eective against updates. (5) Our method delivers excellent performance on long-tail queries having exceptionally large cardinalities and generalizes well on out- of-dataset queries that signicantly dier fr om the dataset. (6) A case study shows that query processing performance is improved by integrating our metho d into a query optimizer . Our contributions are summarize d as follows. • W e develop a deep learning method for cardinality estima- tion of similarity selection (Se ction 3). Our method guaran- tees the monotonicity of cardinality w .r .t. the threshold. • Through feature extraction (Section 4) and regression (Sec- tion 5), our method is generic to any data type and distance function, and exploits the incremental pr operty of cardi- nality to achieve accuracy and monotonicity . The training techniques that favor our method are developed (Section 6). • W e accelerate our model for online estimation (Section 7) and propose incremental learning for updates (Section 8). • W e conduct extensive experiments to demonstrate the su- periority and the generalizability of our method, as w ell as how it works in a query optimizer (Section 9). 2 PRELIMINARIES 2.1 Problem Denition and Notations Let O be a universe of records. 𝑥 and 𝑦 are two records in O . 𝑓 : O × O → R is a function which e valuates the distance (similarity ) of a pair of records. Common distance (similarity) functions include Hamming distance, Jaccard similarity , edit distance, Euclidean distance, etc. Without loss of generality , we assume 𝑓 is distance function. Given a collection of records D ⊆ O , a quer y record 𝑥 ∈ O , and a thr eshold 𝜃 , a similarity selection is to nd all the records 𝑦 ∈ D such that 𝑓 ( 𝑥 , 𝑦 ) ≤ 𝜃 . W e formally dene our problem. Problem 1 (Cardinality Estima tion of Similarity Se- lection). Given a colle ction D of records, a quer y record 𝑥 ∈ O , a distance function 𝑓 , and a threshold 𝜃 ∈ [ 0 , 𝜃 max ] , our task is to estimate the number of records that satisfy the similarity constraint, i.e., | { 𝑦 | 𝑓 ( 𝑥 , 𝑦 ) ≤ 𝜃 , 𝑦 ∈ D } | . 𝜃 max is the maximum threshold (reasonably large for sim- ilarity selection to make sense) we are going to support. A good estimation is supposed to be close to the actual cardinal- ity . Mean squar ed error ( MSE ) and mean absolute percentage error ( MAPE ) are two widely use d evaluation metrics in the cardinality estimation problem [ 21 , 32 , 53 , 57 , 76 ]. Given 𝑛 similarity sele ction queries, let 𝑐 𝑖 denote the actual cardinality of the 𝑖 -th selection and b 𝑐 𝑖 denote the estimate d cardinality . T able 1: Frequently use d notations. Symbol Denition O , D a record universe, a dataset 𝑥 , 𝑦 records in O 𝑓 a distance function 𝜃 , 𝜃 max a distance threshold and its maximum value 𝑐 , b 𝑐 cardinality and the estimate d value 𝑔 , ℎ regression function and feature extraction function x , 𝑑 the binar y representation of 𝑥 and its dimensionality 𝜏 , 𝜏 max a threshold in Hamming space and its maximum value e 𝑖 the emb edding of distance 𝑖 z 𝑖 𝑥 the emb edding of x and distance 𝑖 MSE and MAPE are computed as MSE = 1 𝑛 𝑛  𝑖 = 1 ( 𝑐 𝑖 − b 𝑐 𝑖 ) 2 , MAPE = 1 𝑛 𝑛  𝑖 = 1     𝑐 𝑖 − b 𝑐 𝑖 𝑐 𝑖     . Smaller errors are preferred. W e adopt these two metrics to evaluate the estimation accuracy . W e focus on evaluating the cardinality estimation as stand-alone (in contrast to in an RDBMS) and only consider in-memor y implementations. T able 1 lists the notations frequently used in this paper . W e use bold uppercase letters (e.g., A ) for matrices; bold lowercase letters (e .g., a ) for vectors; and non-bold lowercase letters (e.g., 𝑎 ) for scalars and other variables. Upp ercase Greek symbols (e .g., Φ ) are use d to denote neural networks. A [ 𝑖 , ∗] and A [∗ , 𝑖 ] denote the 𝑖 -th row and the 𝑖 -th column of A , respectively . a [ 𝑖 ] denotes the 𝑖 -th dimension of a . Semicolon r epresents the concatenation of vectors; e.g., given an 𝑎 -dimensional vector a and a 𝑏 -dimensional vector b , c = [ a ; b ] means that c [ 1 . . 𝑎 ] = a and c [ 𝑎 + 1 . . 𝑎 + 𝑏 ] = b . Colon represents the construction of a matrix by column vectors or matrices; e.g., C = [ a : b ] means that C [ ∗ , 1 ] = a and C [ ∗ , 2 ] = b . 2.2 Related W ork 2.2.1 Database Methods. Auxiliary data structure is one of the main types of database methods for the cardinality estima- tion of similarity selection. For binar y vectors, histograms [ 63 ] can be constructed to count the numb er of records by par- titioning dimensions into buckets and enumerating binary vectors and thresholds. For strings and sets, semi-lattice struc- tures [ 45 , 46 ] and inverted indexes [ 36 , 58 ] are utilized for estimation. The major drawback of auxiliar y structure meth- ods is that they only perform w ell on lo w dimensionality and small thresholds. Another typ e of database methods is based on sampling, e.g., uniform sampling, adaptive sampling [ 52 ], and sequential sampling [ 30 ]. State-of-the-art sampling strate- gies [ 35 , 48 , 75 , 83 ] fo cus on join size estimation in query opti- mization, and are dicult to be adopted to our problem dened 3 on distance functions. Sampling metho ds are often combined with sketches [ 27 , 47 ] to impr ove the performance . A state- of-the-art method was propose d in [ 76 ] for high-dimensional data. In general, sampling methods ne ed a large set of samples to achieve good accuracy , and thus become either slow or inac- curate when applied on large datasets. As for the cardinalities of SQL queries, recent studies proposed a tighter b ound for intermediate join car dinalities [ 14 ] and adopted inaccurate cardinalities to generate optimal query plans [71]. 2.2.2 Traditional Learning Mo dels. A prevalent traditional learning method for the cardinality estimation of similarity selection is to train a kernel-base d estimator [ 32 , 57 ] using a sample. Monotonicity is guarantee d when the sample is deterministic w .r .t. the quer y record (i.e., the sample does not change if only the threshold changes). Kernel methods require large numb er of instances for accurate estimation, hence resulting in low estimation sp eed. Moreover , they im- pose strong assumptions on kernel functions, e.g., only diago- nal covariance matrix for Gaussian kernels. Other traditional learning models [ 33 ], such as supp ort vector regression, lo- gistic regression, and gradient b oosting tree, are also adopted to solve the cardinality estimation problem. A query-driven approach [ 12 ] was proposed to learn several quer y prototypes (i.e., interpolation) that are dierentiable. These approaches deliver comparable performance to kernel methods. A recent study [ 24 ] explored the application of two-hidden-layer neu- ral networks and tree-based ensembles (random forest and gradient b oosted trees) on cardinality estimation of multi- dimensional range queries. It targets numerical attributes but does not apply to similarity selections. 2.2.3 Deep Learning Models. Deep learning models ar e re- cently adopted to learn the b est join order [ 44 , 54 , 55 ] or estimate the join size [ 41 ]. Deep reinforcement learning is also explored to generate quer y plans [ 60 ]. Recent studies also adopted local-oriented approach [ 74 ], tree or plan based learning model [ 55 , 56 , 70 ], recurrent neural network [ 61 ], and autoregressive model [ 31 , 77 ]. The metho d that can b e adapted to solve our problem is the mixture of expert model [ 67 ]. It utilizes a sparsely-gated mixture-of-experts layer and assigns good experts (models) to these inputs. Base d on this model, the recursive-model index [ 43 ] was designe d to replace the traditional B-tree index for range queries. The tw o models deliver good accuracy but do not guarantee monotonicity . For monotonic metho ds, an early attempt used a min-max 4-layer neural network for regression [ 19 ]. Lattice regres- sion [ 25 , 26 , 29 , 78 ] is recent monotonic method. It adopts a lattice structure to construct all combinations of monotonic interpolation values. T o handle high-dimensional monotonic features, ensemble of lattices [ 25 ] splits lattices into several small pieces using ensemble learning. T o improv e the per- formance of regression, deep lattice network (DLN) was pro- posed [ 78 ]. It consists of multiple calibration layers and an ensemble of lattices layers. Howev er , lattice regression does not directly target our pr oblem, and our e xperiments sho w that DLN is rather inaccurate. 3 CARDINALI T Y ESTIMA TION FRAMEW ORK 3.1 Basic Idea Let 𝑐 ( 𝑥 , 𝜃 ) denote the cardinality of a quer y 𝑥 with threshold 𝜃 . W e model the estimation as a regression problem with a unique framework designed to alleviate the challenges mentioned in Section 1. W e would like to nd a function b 𝑐 within a function family , such that b 𝑐 returns an approxi- mate value of 𝑐 for any input 𝑥 , i.e., b 𝑐 ( 𝑥 , 𝜃 ) ≈ 𝑐 ( 𝑥 , 𝜃 ) , ∀ 𝑥 ∈ O and 𝜃 ∈ [ 0 , 𝜃 max ] . W e consider b 𝑐 that belongs to the follow- ing family: b 𝑐 B 𝑔 ◦ ℎ , wher e ℎ ( 𝑥 , 𝜃 ) = ( x , 𝜏 ) , 𝑔 ( x , 𝜏 ) ∈ Z ≥ 0 , x ∈ { 0 , 1 } 𝑑 , and 𝜏 ∈ Z ≥ 0 . Intuitively , we can deem ℎ as a feature extraction function, which maps an object 𝑥 and a threshold 𝜃 to a xed-dimensional binary vector x and an integer threshold 𝜏 . Then, the function 𝑔 essentially performs the regression using the transformed input, i.e., the ( x , 𝜏 ) pair . The rationales of such design are analyze d as follows: • This design separates data modelling and cardinality estima- tion into two functions, ℎ and 𝑔 , respectively . On one hand, this allows the system to cater for dierent data types, and distance functions. On the other hand, it allows us to choose the best mo dels for the estimation problem. T o decouple the two components, some common interface ne eds to be established. W e pose the constraints that (1) x belonging to a Hamming space, and (2) 𝜏 is an non-negative integer . For (1), many DB applications deal with discrete objects (e .g., sets and strings) or discrete obje ct representations (e.g., bi- nary codes from learne d hash functions). For (2), since there is a nite number of thresholds that make a dierence in cardinality , in theor y we can always map them to integers, albeit a large number . Here, we take the approximation to limit it to a xed numb er . Other learning mo dels also make similar modelling choice (e.g., most regularizations adopt the smoothness bias in the function space). W e will leave other interface design choices to future work due to their added complexity . For instance, it is entirely possible to restrict x to R 𝑑 and 𝜏 ∈ R . While the modelling power of the framework gets increased, this will inevitably result in more complex models that are potentially dicult to train. • The design can be deeme d as an instance of the encoder- decoder model, where two functions ℎ and 𝑔 are used for some prediction tasks. As a close analog, Google transla- tion [ 37 ] trains an ℎ that maps inputs in the source language to a latent representation, and then train a 𝑔 that maps the 4 latent representation to the destination language. As such, it can support translation between 𝑛 languages by training only 2 𝑛 functions, instead of 𝑛 ( 𝑛 − 1 ) direct functions. By this mo del design, the function b 𝑐 = 𝑔 ◦ ℎ ( 𝑥 , 𝜃 ) is mono- tonic if it satises the condition in the following lemma. Lemma 1. Consider a function ℎ ( 𝑥 , 𝜃 ) monotonically increas- ing with 𝜃 and a function 𝑔 ( x , 𝜏 ) monotonically increasing with 𝜏 , our framework 𝑔 ◦ ℎ ( 𝑥 , 𝜃 ) is monotonically increasing with 𝜃 . 3.2 Feature Extraction The process of feature extraction is to transfer any data typ e and distance threshold into binar y representation and integer threshold. Formally , we have a function ℎ 𝑟 𝑒𝑐 : O → { 0 , 1 } 𝑑 ; i.e., given any record 𝑥 ∈ O , ℎ 𝑟 𝑒𝑐 maps 𝑥 to a 𝑑 -dimensional binary vector , denoted by x , called 𝑥 ’s binar y representation. W e can plug in any user-dene d functions or neural networks for feature extraction. For the sake of estimation accuracy , the general criteria is that the Hamming distance of the target 𝑑 - dimensional binar y vectors can equivalently or approximately capture the semantics of the original distance function. W e will show some e xample feature extraction methods and a series of case studies in Se ction 4. Besides the transformation to binary representations, we also have a monotonically increasing (as demanded by Lemma 1) function ℎ 𝑡 ℎ𝑟 : [ 0 , 𝜃 max ] → [ 0 , 𝜏 max ] to transform the thresh- old. 𝜏 max is a tunable parameter to control the model size (as introduce d later , there are ( 𝜏 + 1 ) decoders in our model, 𝜏 ≤ 𝜏 max ). Given a 𝜃 ∈ [ 0 , 𝜃 max ] , ℎ 𝑡 ℎ𝑟 maps 𝜃 to an integer between 0 and 𝜏 max , denoted by 𝜏 . The purpose of threshold transformation is: for real-valued distance functions, it makes the distances countable; for integer-valued distance functions, it can reduce the threshold to a small numb er , hence to pre- vent the model growing to o big when the input threshold is large. As such, we are able to use nite number of estima- tors to predict the cardinality for each distance value. The design of threshold transformation depends on how original data are transformed to binar y representations. In general, a transformation with less skew leads to better performance. Using the threshold of the Hamming distance between binary representations is not necessary , but would be a preferable option. A few case studies will b e given in Section 4. 3.3 Regression (in a Nutshell) Our method for the regression is based on the following obser- vation: given a binar y representation x and a threshold 𝜏 1 , the cardinality can be divided into ( 𝜏 + 1 ) parts, each representing the cardinality of a Hamming distance 𝑖 , 𝑖 ∈ [ 0 , 𝜏 ] . This sug- gests that we can learn ( 𝜏 + 1 ) functions 𝑔 0 ( x ) , . . . , 𝑔 𝜏 ( x ) , each 1 Note that the threshold is 𝜏 not 𝜏 max here b ecause 𝜃 is mapped to 𝜏 . 𝑔 𝑖 ( x ) estimating the car dinality of the set of records whose Hamming distances to x are exactly 𝑖 . So we have 𝑔 ( x , 𝜏 ) = 𝜏  𝑖 = 0 𝑔 𝑖 ( x ) . (1) This design has the following advantages: • As we have shown in Figure 1(a), the cardinalities for dif- ferent distance values may dier signicantly . By using in- dividual estimators, the distribution of each distance value can be learned separately to achieve b etter overall accuracy . • Our method exploits the incremental property of cardinality: when the threshold increases from 𝑖 to 𝑖 + 1 , the increased car- dinality is the cardinality for distance 𝑖 + 1 . This incremental prediction can guarantee the monotonicity of cardinality estimation: Lemma 2. 𝑔 ( x , 𝜏 ) is monotonically increasing with 𝜏 , if each 𝑔 𝑖 ( x ) , 𝑖 ∈ [ 0 , 𝜏 ] is deterministic and non-negative. The lemma suggests that a deterministic and non-negative model satises the requirement in Lemma 1, hence leading to the overall monotonicity . • W e are able to control the size of the model by setting the maximum number of estimators. Thus, working with the feature extraction, the regression achieves fast sp eed even if the original threshold is large. W e employ a deep enco der-decoder model to process each regression 𝑔 𝑖 . The reasons for choosing de ep models are: (1) Cardinalities may signicantly dier across queries, as shown in Figure 1( b). Deep models are able to learn a variety of underlying distributions and deliver salient performance for general regression tasks [ 43 , 67 , 78 ]. (2) Deep models gen- eralize well on queries that are not covered by the training data. (3) Although deep models usually ne ed large training sets for go od accuracy , the training data here can be easily and eciently acquired by running state-of-the-art similarity se- lection algorithms (and producing no label noise when exact algorithms are used). (4) Deep models can be accelerated by modern hardware (e.g., GP Us/TP Us) or software (e.g., T ensor- ow) that optimizes batch strategies or matrix manipulation 2 . W e employ a de ep learning model to process each regres- sion 𝑔 𝑖 . By carefully choosing enco ders and de coders, we can meet the requirement in Lemma 2 to guarantee the mono- tonicity . The details will b e given in Section 5. Before that, we show some options and case studies of feature extraction. 4 CASE ST UDIES FOR FEA T URE EXTRA CTION As state d in Se ction 3.2, for go od accuracy , a desirable feature extraction is that the Hamming distance between the binary 2 Despite such possibilities for acceleration, we only use them for training but not inference (estimation) in our experiments for the sake of fair comparison. 5 vectors can capture the semantics of the original distance function. W e discuss a few example options. • Equivalency : Some distances can be equivalently expressed in a Hamming space, e.g., 𝐿 1 distance on integer values [ 27 ]. • LSH : W e use 𝑑 hash functions in the locality sensitive hash- ing (LSH) family [ 27 ], each hashing a record to a bit. x and y agree on a bit with high probability if 𝑓 ( 𝑥 , 𝑦 ) ≤ 𝜃 , thus yielding a small Hamming distance b etween x and y . • Bounding : W e may derive a ne cessary condition of the original distance constraint; e.g., 𝑓 ( 𝑥 , 𝑦 ) ≤ 𝜃 = ⇒ 𝐻 ( x , y ) ≤ 𝜏 , where 𝐻 ( · , · ) denotes the Hamming distance. For the equivalency method, since the conversion to Ham- ming distance is lossless, it can be used atop of the other two. This is useful when the output of the hash function or the bound is not in a Hamming space. Note that our model is not limited to these options. Other feature extraction methods, such as emb edding [80], can be also used here. As for threshold transformation, we have two parameters: 𝜃 max , the maximum threshold we ar e going to support, and 𝜏 max , a tunable parameter to control the size of our model. Any threshold 𝜃 ∈ [ 0 , 𝜃 max ] is monotonically mapped to an integer 𝜏 ∈ [ 0 , 𝜏 max ] . In our case studies, w e consider using a trans- formation proportional to the (expected/bounde d) Hamming distance between binary representations. Note that 𝜃 max is not necessarily mapped to 𝜏 max , b ecause for integer-value d dis- tance functions, the numb er of available thresholds is smaller than 𝜏 max + 1 when 𝜃 max < 𝜏 max , meaning that only ( 𝜃 max + 1 ) decoders are useful. In this case, 𝜃 max is mapped to a value smaller than 𝜏 max . Next we sho w four case studies of some common data typ es and distance functions. 4.1 Hamming Distance W e consider binary vector data and Hamming distance as the input distance function. The original data are directly fed to our regression mo del. Since the function is already Hamming distance, we use the original threshold 𝜃 as 𝜏 , if 𝜃 max ≤ 𝜏 max . Otherwise, we map 𝜃 max to 𝜏 max , and other thresholds are mapped proportionally: 𝜏 = ⌊ 𝜏 max · 𝜃 / 𝜃 max ⌋ . Although mul- tiple thresholds may map to the same 𝜏 , we can increase the number of decoders to mitigate the imprecision. 4.2 Edit Distance The (Levenshtein) edit distance measures the minimum num- ber of operations, including insertion, deletion, and substitu- tion of a character , to transform one string to another . The feature extraction is based on bounding. The basic idea is to map each character to ( 2 𝜏 max + 1 ) bits, hence to cover the eect of insertion and deletion. Let Σ denote the alphabet of strings, and 𝑙 max denote the maximum string length in D . Each binar y vector has 𝑑 = ( ( 𝑙 max + 2 𝜏 max ) · | Σ | ) bits. They are divided into | Σ | groups, each group representing a character in Σ . For ease of exposition, we assume the subscript of a string start from 0 , and the subscript of each group of the binary vector start from − 𝜏 max . All the bits are initialized as 0 . Given a string 𝑥 , for each character 𝜎 at p osition 𝑖 , we set the 𝑗 -th bit in the 𝜎 -th group to 1 , where 𝑗 iterates through 𝑖 − 𝜏 max to 𝑖 + 𝜏 max . For example, given a string 𝑥 = abc , Σ = { a , b , c , d } , 𝑙 max = 4 , and 𝜏 𝑚𝑎𝑥 = 1 , the binar y vector is 111000, 011100, 001110, 000000 (groups separated by comma). It can b e proved that an edit op eration causes at most ( 4 𝜏 max + 2 ) dierent bits. Hence 𝑓 ( 𝑥 , 𝑦 ) edit operations yield a Hamming distance no greater than 𝑓 ( 𝑥 , 𝑦 ) · ( 4 𝜏 max + 2 ) . Since it is proportional to 𝑓 ( 𝑥 , 𝑦 ) and thresholds are integers, we use the same threshold transformation as for Hamming distance. 4.3 Jaccard Distance Given two sets 𝑥 and 𝑦 , the Jaccard similarity is dened as | 𝑥 ∩ 𝑦 | / | 𝑥 ∪ 𝑦 | . For ease of exposition, we use its distance form: 𝑓 ( 𝑥 , 𝑦 ) = 1 − | 𝑥 ∩ 𝑦 | / | 𝑥 ∪ 𝑦 | . W e use 𝑏 -bit minwise hashing [ 50 ] (LSH) for feature ex- traction. Given a record 𝑥 , 𝜋 ( 𝑥 ) orders the elements of 𝑥 by a permutation on the record universe O . W e uniformly choose a set of 𝑘 permutations { 𝜋 1 , . . . , 𝜋 𝑘 } . Let 𝑏𝑚𝑖𝑛 ( 𝜋 ( 𝑥 ) ) denote the last 𝑏 bits of the smallest element of 𝜋 ( 𝑥 ) . W e re- gard 𝑏𝑚𝑖𝑛 ( 𝜋 ( 𝑥 ) ) as an integer in [ 0 , 2 𝑏 − 1 ] and transform it to a Hamming space. Let 𝑠 𝑒 𝑡 _ 𝑏 𝑖 𝑡 ( 𝑖 , 𝑗 ) produce a one-hot binary vector such that only the 𝑖 -th bit is 1 out of 𝑗 bits. 𝑥 is transformed to a 𝑑 -dimensional ( 𝑑 = 2 𝑏 𝑘 ) binary vector: [ 𝑠𝑒 𝑡 _ 𝑏𝑖 𝑡 ( 𝑏𝑚𝑖𝑛 ( 𝜋 1 ( 𝑥 ) ) , 2 𝑏 ) ; . . . ; 𝑠 𝑒 𝑡 _ 𝑏 𝑖 𝑡 ( 𝑏𝑚𝑖 𝑛 ( 𝜋 𝑘 ( 𝑥 ) ) , 2 𝑏 ) ] . For example: 𝑥 = { 1 , 2 , 4 } . O = { 1 , 2 , 3 , 4 , 5 } . 𝜋 1 = 12345 , 𝜋 2 = 54321 , and 𝜋 3 = 21453 . 𝑏 = 2 . W e have 𝑏𝑚𝑖𝑛 ( 𝜋 1 ( 𝑥 ) ) = 1 , 𝑏𝑚𝑖𝑛 ( 𝜋 2 ( 𝑥 ) ) = 0 , and 𝑏𝑚𝑖𝑛 ( 𝜋 3 ( 𝑥 ) ) = 2 . Supp ose 𝑠 𝑒 𝑡 _ 𝑏 𝑖 𝑡 counts from the lowest bit, starting from 0 . The binary vector is 0010, 0001, 0100 (permutations separated by comma). Given two sets 𝑥 and 𝑦 , the probability that 𝑏𝑚𝑖𝑛 ( 𝜋 ( 𝑥 ) ) = 𝑏𝑚𝑖𝑛 ( 𝜋 ( 𝑦 ) ) equals to 1 − 𝑓 ( 𝑥 , 𝑦 ) [ 50 ]. The expe cted Ham- ming distance b etween x and y is thus 𝑓 ( 𝑥 , 𝑦 ) · 𝑑 . Since it is proportional to 𝑓 ( 𝑥 , 𝑦 ) , we use the following threshold trans- formation: 𝜏 = ⌊ 𝜏 max · 𝜃 / 𝜃 max ⌋ . 4.4 Euclidean Distance W e use LSH based on 𝑝 -stable distribution [ 22 ] to handle Eu- clidean distance on real-valued vectors. The hash function is ℎ a , 𝑏 ( 𝑥 ) = ⌊ a 𝑥 + 𝑏 𝑟 ⌋ , where a is a | 𝑥 | -dimensional vector with each element indep endently drawn by a normal distribution N ( 0 , 1 ) , 𝑏 is a real numb er chosen uniformly from [ 0 , 𝑟 ] , and 𝑟 is a predened constant value. Let 𝑣 denote the maximum hash value. W e use the aforementioned 𝑠 𝑒 𝑡 _ 𝑏 𝑖 𝑡 function to trans- form hash values to a Hamming space. Given 𝑘 hash functions, 𝑥 is transformed to a 𝑑 -dimensional ( 𝑑 = 𝑘 ( 𝑣 + 1 ) ) binary vec- tor: [ 𝑠 𝑒 𝑡 _ 𝑏𝑖 𝑡 ( ℎ a 1 , 𝑏 1 ( 𝑥 ) , 𝑣 + 1 ) ; . . . ; 𝑠 𝑒 𝑡 _ 𝑏 𝑖 𝑡 ( ℎ a 𝑘 , 𝑏 𝑘 ( 𝑥 ) , 𝑣 + 1 ) ] . For example: 𝑥 = [ 0 . 1 , 0 . 2 , 0 . 4 ] . 𝑣 = 4 . ℎ a 1 , 𝑏 1 ( 𝑥 ) = 1 . ℎ a 2 , 𝑏 2 ( 𝑥 ) = 3 . 6 ℎ a 3 , 𝑏 3 ( 𝑥 ) = 4 . Suppose 𝑠 𝑒 𝑡 _ 𝑏 𝑖 𝑡 counts from the lowest bit, starting from 0 . The binar y vector is 00010, 01000, 10000 (hash functions separated by comma). Given two records 𝑥 and 𝑦 such that 𝑓 ( 𝑥 , 𝑦 ) = 𝜃 , the proba- bility that two hash values match is 𝑃 𝑟 { ℎ a , 𝑏 ( 𝑥 ) = ℎ a , 𝑏 ( 𝑦 ) } = 𝜖 ( 𝜃 ) = 1 − 2 · 𝑛𝑜 𝑟 𝑚 ( − 𝑟 / 𝜃 ) − 2 √ 2 𝜋 𝑟 / 𝜃 ( 1 − 𝑒 − ( 𝑟 2 / 2 𝜃 2 ) ) , where 𝑛𝑜𝑟 𝑚 ( ·) is the cumulative distribution function of a random variable with normal distribution N ( 0 , 1 ) [ 22 ]. Hence the expected Hamming distance between their binar y represen- tations is ( 1 − 𝜖 ( 𝜃 ) ) · 𝑑 . The threshold transformation is 𝜏 = ⌊ 𝜏 max · 1 − 𝜖 ( 𝜃 ) 1 − 𝜖 ( 𝜃 max ) ⌋ . 5 REGRESSION W e present the detailed regression model in this se ction. Fig- ure 2 shows the framework of our model. (1) x and 𝜏 are input to the encoder Ψ , which returns 𝜏 + 1 embe ddings z 𝑖 𝑥 . Spe ci- cally , x is embedde d to a dense vector space. Each distance 𝑖 is also embedded, concatenate d to the embedding of x , and fed to a neural network Φ to produce z 𝑖 𝑥 . (2) Each of the 𝜏 + 1 decoders 𝑔 𝑖 takes an embe dding z 𝑖 𝑥 as input and returns the cardinality of distance 𝑖 , 𝑖 ∈ [ 0 , 𝜏 ] . (3) The 𝜏 + 1 cardinalities are summed up to get the nal cardinality . 5.1 Encoder-Deco der Mo del Our solution to the regression is to embed the binary vector x and distance 𝑖 to a dense real-valued vector z 1 𝑥 by an enco der Ψ , and then model 𝑔 𝑖 ( x ) as a de coder that performs an ane transformation and applies an ReLU activation function: 𝑔 𝑖 ( x ) = ReLU ( w T 𝑖 Ψ ( x , 𝑖 ) + 𝑏 𝑖 ) = ReLU ( w T 𝑖 z 𝑖 𝑥 + 𝑏 𝑖 ) . w 𝑖 and 𝑏 𝑖 are parameters of the mapping from the embed- ding z 𝑖 𝑥 to the cardinality estimation of distance 𝑖 . From the machine learning persp ective, if a representation of input features is well learned through an encoder , then a linear model (ane transformation) is capable of making nal de- cisions [ 13 ]. ReLU is chosen here be cause cardinality is non- negative and matches the range of ReLU . The reason why we also emb ed distance 𝑖 is as follows. Consider only x is em- bedded. If the cardinalities of two records 𝑥 1 and 𝑥 2 are close for distance values in a range [ 𝜏 1 , 𝜏 2 ] covered by the training examples, their embeddings are likely to b ecome similar after training, be cause the encoder may mistakenly regard 𝑥 1 and 𝑥 2 as similar . This may cause 𝑔 𝑖 ( x 1 ) ≈ 𝑔 𝑖 ( x 2 ) for 𝑖 ∉ [ 𝜏 1 , 𝜏 2 ] , i.e., the distance values not covered by the training examples, even if their actual cardinalities signicantly dier . By Equation 1, the output of the 𝜏 decoders are summe d up to obtain the cardinality . 𝑔 𝑖 ( x ) is deterministic if w e use a deterministic Ψ ( · , ·) . Hence the model can satisfy the require- ment in Lemma 2 to guarante e the monotonicity . 5.2 Encoder in Detail T o encode both x and distance 𝑖 to embe dding z 𝑖 𝑥 , Ψ includes a representation network Γ that maps x to a dense vector space, a distance embedding layer E , and a shared neural netw ork Φ that outputs the embe dding z 𝑖 𝑥 . Next we introduce the details. 5.2.1 Representation Network. Given a binary representa- tion x generated by feature extraction function ℎ ( · , ·) , we design a neural network Γ that maps x to another vector space: x ′ = Γ ( x ) , because the correlations of sparse high- dimensional binary v ectors are dicult to learn. V ariational auto-encoder ( V AE ) [ 40 ] is a generative model to estimate data distribution by unsuper vised learning. W e can view auto- encoders (AEs) as non-linear PCAs to r educe dimensionality and extract meaningful and robust features, and V AE enforces continuity in the latent space. V AE improves upon other types of AEs (such as denoising AEs and sparse AEs) by imposing some regularization condition on the embeddings. This re- sults in embeddings that are robust and disentangle d, and hence have be en widely use d in various models [ 72 ]. W e use the latent layer of V AE to produce a dense r epresentation, de- noted by 𝑉 𝐴𝐸 ( x , 𝜖 ) . 𝜖 is a random noise generated by normal distribution N ( 0 , I ) . Γ concatenates x and the output of V AE ; i.e., x ′ = [ x ; 𝑉 𝐴𝐸 ( x , 𝜖 ) ] . The r eason for such concatenation (i.e., not using only the output of V AE as x ′ ) is that the (co- sine) distance in the 𝑉 𝐴𝐸 ( x , 𝜖 ) space captures less semantics of the original distance than do es the Hamming distance be- tween binary vectors. Due to the noise 𝜖 , the output of V AE becomes nondeterministic. Since we need a deterministic output to guarante e the monotonicity , we choose the follow- ing option: for training, we still use the nondeterministic x ′ = [ x ; 𝑉 𝐴𝐸 ( x , 𝜖 ) ] , be cause this makes our model generalize to unseen records and thresholds; for inference (online es- timation), we set x ′ = [ x ; E 𝜖 ∼ N ( 0 , I ) [ 𝑉 𝐴 𝐸 ( x , 𝜖 ) ] ] , where E [ ·] denotes the expecte d value, so it becomes deterministic [ 68 ]. Example 1. Figure 3 shows an example of x and its emb ed- ding x ′ . Suppose x = 0010 . The V AE takes x as input and output a dense vector , say [ 0 . 7 , 1 . 2 ] . Then they are concatenated to obtain x ′ = [ 0010 , 0 . 7 , 1 . 2 ] . 5.2.2 Distance Embeddings. In order to embe d x and distance 𝑖 into the same vector , we design a distance embedding layer (a matrix) E to embed each distance 𝑖 . Each column in E repre- sents a distance embe dding; i.e., e 𝑖 = E [ ∗ , 𝑖 ] . E is initialize d randomly , following standard normal distribution. 5.2.3 Final Embe ddings. The distance embe dding e 𝑖 is con- catenated with x ′ ; i.e., x 𝑖 = [ x ′ ; e 𝑖 ] . Then we use a fe edforward neural network (FNN) Φ to generate emb eddings z 𝑖 𝑥 = Φ ( x 𝑖 ) . Example 2. W e follow Example 1. Suppose 𝜏 = 2 . Then we have three distance emb eddings e 0 = [ 1 . 1 , 0 . 7 ] , e 1 = [ 1 . 5 , 0 . 3 ] , and e 2 = [ 1 . 8 , 0 . 9 ] . By concatenating x ′ and each e 𝑖 , x 0 = 7 x Γ z 0 x D e c o d e r  g 0 E x ′ 0 , 1 , … , τ ... e 0 e 1 e τ Φ Φ Φ ... z 1 x z τ x D e c o d e r  g 1 D e c o d e r  g τ c 0 ˆ c 1 ˆ c τ ˆ Share parameters ... ... + c ˆ E n c o d e r  Ψ Figure 2: The regression model. 0 � Γ � � ′ 0,1,2 � 0 � 1 � 2 0 1 0 0 0 1 0 0.7 1.2 0 0 1 0 0.7 1.2 1.1 0.7 0 0 1 0 0.7 1.2 1.5 0.3 0 0 1 0 0.7 1.2 1.8 0.9 Φ � 0 � � 1 � � 2 � 1.1 0.7 1.5 0.3 1.8 0.9 0.2 1.1 0.8 0.9 0.4 1.1 0.8 1.7 1.4 � 0 � 1 � 2 Figure 3: Example of encoder Ψ . [ 0010 , 0 . 7 , 1 . 2 , 1 . 1 , 0 . 7 ] , x 1 = [ 0010 , 0 . 7 , 1 . 2 , 1 . 5 , 0 . 3 ] , and x 2 = [ 0010 , 0 . 7 , 1 . 2 , 1 . 8 , 0 . 9 ] . They are sent to neural network Φ , whose output is z 0 𝑥 = [ 0 . 2 , 1 . 1 , 0 . 8 ] , z 1 𝑥 = [ 0 . 9 , 0 . 4 , 1 . 1 ] , and z 2 𝑥 = [ 0 . 8 , 1 . 7 , 1 . 4 ] . 6 MODEL TRAINING 6.1 Data Preparation Consider a query workload Q of records (see Section 9.1.1 for the choice of Q in our experiments). W e split data in Q for training, validation, and testing sets. Then we uniformly generate a set of thresholds in [ 0 , 𝜃 max ] , denoted by 𝑆 . For each r ecord 𝑥 in the training set, we iterate through all the thresholds 𝜃 in 𝑆 and compute the cardinality 𝑐 w .r .t. D using an exact similarity selection algorithm. Then ⟨ 𝑥 , 𝜃 , 𝑐 ⟩ is used as a training example. W e uniformly choose thresholds in 𝑆 for validation and in [ 0 , 𝜃 max ] for testing. 6.2 Loss Function & Dynamic Training The loss function is dened as follows. L ( b c , c ) = E 𝜏 ∼ 𝑃 ( ·) [ L 𝑔 ( b c , c ) ] + 𝜆 L 𝑣𝑎𝑒 ( x ) , (2) where L 𝑔 ( · , ·) is the loss of regression model, and L 𝑣𝑎𝑒 ( · ) is the loss of V AE . b c and c are tw o vectors, each dimension representing the estimated and the real cardinalities of a set of training examples, respectively . 𝜆 is a positive hyperparam- eter for the importance of V AE . A caveat is that although we uniformly sample thresholds in [ 0 , 𝜃 max ] for training data, it does not necessarily mean the threshold 𝜏 after feature extrac- tion is uniformly distributed in [ 0 , 𝜏 max ] , e.g., for Euclidean distance (Section 4.4). T o take this factor into account, we approximate the probability of 𝜏 using the empirical proba- bility of thresholds after running feature extraction on the validation set; i.e., 𝑃 ( 𝜏 ) ≈ Í ⟨ 𝑥 ,𝑖,𝑐 ⟩ ∈ T 𝑣𝑎𝑙 𝑖𝑑 1 ℎ 𝑡 ℎ𝑟 ( 𝑖 ) = 𝜏 | T 𝑣𝑎𝑙 𝑖𝑑 | , where T 𝑣𝑎𝑙 𝑖 𝑑 is the validation set, and 1 is the indicator function. For L 𝑔 , instead of using MSE or MAPE , we r esort to the mean squared logarithmic error ( MSLE ) for the following reason: MSLE is an approximation of MAPE [ 62 ] and nar- rows down the large output space to a smaller one, thereby decreasing the learning diculty . Then w e propose a training strategy for better accuracy . Given a set of training e xamples, let c 0 , . . . , c 𝜏 and b c 0 , . . . , b c 𝜏 denote the cardinalities and the estimated values for distance 0 , . . . , 𝜏 in these training examples, respectively . As we have shown in Figure 1(a), the car dinalities may vary signicantly for dierent distance values. Some of them may result in much worse estimations than others and compromise the ov erall performance. The training procedure should gradually focus on training these bad estimations. Thus, we consider the loss caused by the estimation for distance 𝑖 , combined with MSLE : L 𝑔 ( b c , c ) = MSLE ( b c , c ) + 𝜆 Δ ·  𝜏 𝑚𝑎𝑥  𝑖 = 0 𝜔 𝑖 · MSLE ( b c 𝑖 . c 𝑖 )  . (3) Each 𝜔 𝑖 is a hyperparameter automatically adjusted during the training procedure. Hence we call it dynamic training. It controls the loss of each estimation for distance 𝑖 . Í 𝜏 𝑚𝑎𝑥 𝑖 = 0 𝜔 𝑖 = 1 . 𝜆 Δ is a hyp erparameters to control the impact of the losses of all the estimations for 𝑖 ∈ [ 0 , 𝜏 max ] . Due to the non-convexity of L 𝑔 , it is dicult to nd the correct direction of gradient that reaches the global or a go od local optimum. Nonetheless, we can adjust its direction by 8 ...  1  2    1  2   ...   ′ Figure 4: Φ ′ in the accelerate d regression model. considering the loss trend of the estimation for distance 𝑖 , hence to encourage the model to generalize rather than to overt the training data. Let ℓ 𝑖 ( 𝑡 ) = MSLE ( b c 𝑖 , c 𝑖 ) denote the loss of the estimation for distance 𝑖 in the 𝑡 -th iteration of validation. The loss trend Δ ℓ 𝑖 ( 𝑡 ) = ℓ 𝑖 ( 𝑡 ) − ℓ 𝑖 ( 𝑡 − 1 ) is calculated, and then after each validation we adjust 𝜔 𝑖 by adding more gradients to where the loss occurs: (1) If Δ ℓ 𝑖 ( 𝑡 ) > 0 , 𝜔 𝑖 = Δ ℓ 𝑖 ( 𝑡 ) Í 𝑖 ∈ 𝐴 Δ ℓ 𝑖 ( 𝑡 ) , 𝐴 = { 𝑖 | Δ ℓ 𝑖 ( 𝑡 ) > 0 , 0 ≤ 𝑖 ≤ 𝜏 max } ; (2) other wise, 𝜔 𝑖 = 0 . 7 A CCELERA TING ESTIMA TION Recall in the regression model, we pair x ′ and ( 𝜏 + 1 ) distance embeddings in encoder Ψ to produce embedings z 𝑖 𝑥 . This leads to high computation cost for online cardinality estimation when 𝜏 is large. T o reduce the cost, we propose an accelerated model, using a neural network Φ ′ to replace Φ and the dis- tance embe dding layer E to output ( 𝜏 max + 1 ) z 𝑖 𝑥 embeddings together 3 . Φ ′ only takes an input x ′ and reduces the compu- tation cost from 𝑂 ( ( 𝜏 + 1 ) | Φ | ) to 𝑂 ( | Φ ′ | ) , where | Φ | and | Φ ′ | denote the numb er of parameters in the two neural networks. Figure 4 shows the framework of Φ ′ , an FNN comprised of 𝑛 hidden layers f 1 , f 2 , . . . , f 𝑛 , each outputting some dimensions of the embeddings z 𝑖 𝑥 . Each z 𝑖 𝑥 is partitioned into 𝑛 regions, denoted by z 𝑖 𝑥 [ 𝑟 0 , 𝑟 1 ] , z 𝑖 𝑥 [ 𝑟 1 , 𝑟 2 ] , . . . , z 𝑖 𝑥 [ 𝑟 𝑛 − 1 , 𝑟 𝑛 ] , where 0 = 𝑟 0 ≤ 𝑟 1 ≤ . . . ≤ 𝑟 𝑛 and 𝑟 𝑛 equals to the dimensionality of z 𝑖 𝑥 . A hidden layer f 𝑗 outputs the 𝑗 -th region of z 𝑖 𝑥 , 𝑖 ∈ [ 0 , 𝜏 max ] ; i.e., Z 𝑗 = [ z 0 𝑥 [ 𝑟 𝑗 − 1 , 𝑟 𝑗 ] : z 1 𝑥 [ 𝑟 𝑗 − 1 , 𝑟 𝑗 ] : . . . : z 𝜏 𝑚𝑎𝑥 𝑥 [ 𝑟 𝑗 − 1 , 𝑟 𝑗 ] ] . Then we concatenate all the regions: Z = [ Z 1 : Z 2 : . . . : Z 𝑛 ] . Each row of Z is an embedding: z 𝑖 𝑥 = Z [ 𝑖 , ∗] . In addition to fast estimation, the model Φ ′ has the follow- ing advantages: (1) The parameters of each hidden layer f 𝑗 are update d from the following layer f 𝑗 + 1 and the nal embed- ding matrix Z through backpropagation, hence increasing the ability to learn go od emb eddings. (2) Since each embedding is aected by all the hidden layers, the mo del is more likely to reach a b etter local optimum through training. (3) In contrast to one output layer , all the hidden layers output emb eddings, so gradient vanishing and overtting can b e prevented. 3 W e output ( 𝜏 max + 1 ) embeddings since it is constant for all queries, hence to favor implementation. Only the rst ( 𝜏 + 1 ) embedding are used for 𝜏 . W e analyze the complexities of our mo dels. Assume an FNN has hidden layers a 1 , . . . , a 𝑛 . Given an input x and an output y , the complexity of an FNN is | FNN ( x , y ) | = | x | · | a 1 | + Í 𝑛 − 1 𝑖 = 1 | a 𝑖 | · | a 𝑖 + 1 | + | a 𝑛 | · | y | . Our model has the follow- ing components: Φ , Γ , E , and 𝑔 𝑖 . The complexities of Φ and Γ are | FNN ( [ x ′ ; e 𝑖 ] , z 𝑖 𝑥 ) | and | FNN ( x , x ) | , respectively . | E | = ( 𝜏 max + 1 ) | e 𝑖 | . | 𝑔 𝑖 | = ( 𝜏 max + 1 ) | z 𝑖 𝑥 | + 𝜏 max + 1 . Thus, the complex- ity of our model without acceleration is | FNN ( [ x ′ ; e 𝑖 ] , z 𝑖 𝑥 ) | + | FNN ( x , x ) | + ( 𝜏 max + 1 ) | e 𝑖 | + ( 𝜏 max + 1 ) | z 𝑖 𝑥 | + 𝜏 max + 1 . With ac- celeration for FNN (AFNN), the complexity is | AFNN ( x ′ , Z ) | + | FNN ( x , x ) | + ( 𝜏 max + 1 ) | z 𝑖 𝑥 | + 𝜏 max + 1 , where | AFNN ( x ′ , Z ) | = | x ′ | · | a 1 | + Í 𝑛 − 1 𝑖 = 1 | a 𝑖 | · | a 𝑖 + 1 | + ( 𝜏 max + 1 ) | a 𝑛 | · | Z [ 𝑖 , ∗] | . 8 DEALING WI TH UPD A TES When the dataset is update d, the labels of the validation data are rst update d by running a similarity sele ction algorithm on the updated dataset. Then we monitor the error ( MSLE ) in the validate d data by running our model. If the error increases, we train our mo del with incr emental learning: First, the la- bels of the training data are up dated by running a similarity selection algorithm on the updated dataset. Then the mo del is trained with the update d training data until the validation error does not change for three conse cutive epo chs. Note that (1) the training do es not start from scratch but from the cur- rent model, and it is processed on the entire training data to prevent catastrophic forgetting [ 39 , 59 ]; and (2) we always keep the original queries and only update their labels. 9 EXPERIMEN TS 9.1 Experiment Setup 9.1.1 Datasets and eries. W e use eight datasets for four distance functions: Hamming distance ( HM ), edit distance ( ED ), Jaccard distance ( JC ), and Euclidean distance ( EU ). The datasets and the statistics are shown in T able 2. Boldface indicates default datasets. Process indicates how we process the dataset; e.g., HashNet [ 15 ] is adopted to convert images in the ImageNet dataset [ 66 ] to hash co des. ℓ 𝑚𝑎𝑥 and ℓ 𝑎 𝑣𝑔 are the maximum and average lengths or dimensionalities of records, respectively . W e uniformly sample 10% data from dataset D as the query workload Q . Then we follow the method in Section 6.1 to split Q in 80 : 10 : 10 to create training, validation, and testing instances. Multiple uniform sampling is not considered here b ecause even if our mo dels are trained on skewed data sampled with equal chance from each cluster of D , we observe only moderate change of accuracy of our models (up to 48% MSE ) when testing over multiple uniform samples and they still p erform signicantly better than the other competitors. Thresholds and labels (cardinalities w .r .t. D ) are generated using the metho d in Se ction 6.1. 9 T able 2: Statistics of datasets. Dataset Source Process Data Type Domain # Records ℓ 𝑚𝑎𝑥 ℓ 𝑎 𝑣𝑔 Distance 𝜃 𝑚𝑎𝑥 HM-ImageNet [1] HashNet [15] binary vector image 1,431,167 64 64 Hamming 20 HM-PubChem [2] - binary vector biological sequence 1,000,000 881 881 Hamming 30 ED- AMiner [3] - string author name 1,712,433 109 13.02 edit 10 ED-DBLP [4] - string publication title 1,000,000 199 72.49 edit 20 JC-BMS [5] - set product entr y 515,597 164 6.54 Jaccard 0.4 JC-DBLP 𝑞 3 [4] 3-gram set publication title 1,000,000 197 70.49 Jaccard 0.4 EU-Glove 300 [6] normalize r eal-valued v ector word emb edding 1,917,494 300 300 Euclidean 0.8 EU-Glove 50 [6] normalize r eal-valued v ector word emb edding 400,000 50 50 Euclidean 0.8 T able 3: MSE , b est values highlighte d in b oldface. Model HM-ImageNet HM-PubChem ED- AMiner ED-DBLP JC-BMS JC-DBLP 𝑞 3 EU-Glove 300 EU-Glove 50 DB-SE 41563 445182 8219583 1681 1722 177 116820 45631 DB-US 27776 66255 159572 1095 3052 427 78552 16249 TL-XGB 12082 882206 4147509 1657 2031 23 821937 557229 TL-LGBM 14132 721609 4830965 2103 1394 49 844301 512984 TL-KDE 279782 112952 3412627 2097 873 100 102200 169604 DL-DLN 7307 189743 1285010 1664 1349 50 1063687 49389 DL-MoE 7096 95447 265257 1235 425 23 988918 315437 DL-RMI 6774 42186 93158 928 151 15 45165 6791 DL-DNN 10075 231167 207286 1341 1103 138 1192426 27892 DL-DNNs 𝜏 4236 51026 217193 984 1306 207 1178239 87991 DL-BiLSTM - - 104152 1034 - - - - DL-BiLSTM- A - - 115111 1061 - - - - CardNet 2871 12809 52101 446 75 2 6822 3245 CardNet- A 3044 11598 64831 427 64 3 16809 3269 T able 4: MAPE (in percentage), best values highlighted in boldface. Model HM-ImageNet HM-PubChem ED- AMiner ED-DBLP JC-BMS JC-DBLP 𝑞 3 EU-Glove 300 EU-Glove 50 DB-SE 56.14 79.74 80.15 57.23 45.12 10.42 41.91 47.12 DB-US 62.51 141.04 61.98 56.80 60.06 50.52 112.06 98.23 TL-XGB 13.87 152.20 113.68 33.26 26.52 6.70 14.46 33.87 TL-LGBM 14.12 110.22 115.88 30.29 21.39 10.71 17.49 37.54 TL-KDE 85.57 179.39 105.17 60.23 27.59 37.79 52.38 59.84 DL-DLN 20.72 174.69 73.48 39.10 42.50 6.54 21.67 33.95 DL-MoE 11.93 49.47 57.79 31.81 16.37 4.10 11.94 26.82 DL-RMI 12.36 50.57 52.81 32.24 15.02 4.78 5.48 15.03 DL-DNN 14.24 198.36 51.36 34.12 28.41 5.11 7.24 17.57 DL-DNNs 𝜏 13.00 46.43 53.82 30.91 19.67 5.81 9.19 21.95 DL-BiLSTM - - 43.44 40.95 - - - - DL-BiLSTM- A - - 45.25 41.12 - - - - CardNet 8.41 35.66 42.26 22.53 11.25 3.18 4.04 11.85 CardNet- A 9.63 36.57 44.78 23.07 13.94 3.05 4.58 12.71 9.1.2 Models. Our models are referred to as CardNet and CardNet- A . The latter is e quipped with the acceleration (Sec- tion 7). W e compare with the following categories of methods: (1) Database metho ds: DB-SE , a specialized estimator for each distance function ( histogram [ 63 ] for HM , inverted index [ 36 ] for ED , semi-lattice [ 46 ] for JC , and LSH-base d sampling [ 76 ] for EU ) and DB-US , which uniformly samples 1% records from D and estimates cardinality using the sample. W e do not con- sider higher sample ratios because 1% samples are already very slow (T able 5). (2) Traditional learning methods: TL-XGB (XGBoost) [ 16 ], TL-LGBM (LightGBM) [ 38 ], and TL-KDE [ 57 ]. (3) Deep learning methods: DL-DLN [ 78 ]; DL-MoE [ 67 ]; DL- RMI [ 43 ]; DL-DNN , a vanilla FNN with four hidden layers; and DL-DNNs 𝜏 , a set of ( 𝜏 𝑚𝑎𝑥 + 1 ) independently learned 10 deep neural networks, each against a threshold range (com- puted using the threshold transformation in Se ction 4). For ED , we also have a metho d that replaces Γ in CardNet or CardNet- A with a character-level bidirectional LSTM [ 17 ], referred to as DL-BiLSTM or DL-BiLSTM-A . Since the above learning models need vectors ( except for TL-KDE which is fed with original input) as input, we use the same feature extraction as our models on ED and JC . On HM and EU , they are fed with original vectors. A s for other de ep learning mo d- els [ 31 , 41 , 55 , 56 , 60 , 61 , 70 , 74 , 77 ], when adapted for our problem, [ 41 ] becomes a feature extraction by deep set [ 79 ] plus a regression by DL-DNN , while the others become ex- actly DL-DNN . W e will show that deep set is outperformed by our feature extraction (T able 6, also obser ved when applying to other metho ds). Hence these models are not repeatedly compared. Among the compared models, DB-SE , TL-X GB , TL-LGBM , TL-KDE , DL-DLN , and our models ar e monotonic. Among the models involving FNNs, DL-MoE and DL-RMI are more complex than ours in most cases, depending on the number of FNNs and other hyperparameter tuning. DL-DNN is less complex than ours. DL-DNNs 𝜏 is more complex. 9.1.3 Hyperparameter Tuning. W e use 256 hash functions for Jaccard distance, 256 (on EU-Glove 50 ) and 512 (on EU-Glove 300 ) hash functions for Euclidean distance. The V AE is a fully- connected neural network, with thr ee hidden layers of 256, 128, and 128 nodes for both enco der and de coder . The activa- tion function is ELU , in line with [ 40 ]. The dimensionality of the V AE ’s output is 40, 128, 128, 128, 64, 64, 64, 32 as per the order in T able 2. W e use a fully-connected neural netw ork with four hidden layers of 512, 512, 256, and 256 nodes for b oth Φ and Φ ′ . The activation function is ReLU . The dimensionality of distance embeddings is 5. The dimensionality of z 𝑖 𝑥 is 60. W e set 𝜆 in Equation 2 and 𝜆 Δ in Equation 3 to both 0.1. The V AE is trained for 100 epo chs. Our models are traine d for 800 epochs. 9.1.4 Environments. The experiments were carried out on a server with a Intel Xeon E5-2640 @2.40GHz CP U and 256GB RAM running Ubuntu 16.04.4 LTS. Non-deep models were implemented in C++. Deep mo dels were trained in T ensoro w , and then the parameters were copied to C++ implementations for a fair comparison of estimation eciency , in line with [ 43 ]. 9.2 Estimation Accuracy W e report the accuracies of various models in T ables 3 and 4, measured by MSE and MAPE . CardNet and CardNet- A report similar MSE and MAPE . They achieve the b est performance on almost all the datasets (except CardNet- A ’s MAPE on ED- AMiner ), showcasing that the components in our mo del design collectively lead to better accuracy . For the four distance func- tions, the MSE (of the better one of our two models) is at least 0 2 4 6 8 10 12 14 16 18 20 Threshold 0 10 0 10 1 10 2 10 3 10 4 10 5 MSE CardNet CardNet-A TL-XGB DL-RMI DL-MoE DB-US DL-DLN (a) MSE , HM-ImageNet 0 2 4 6 8 10 12 14 16 18 20 Threshold 0 10 20 30 40 50 MAPE (%) CardNet CardNet-A TL-XGB DL-RMI DL-MoE DB-US DL-DLN (b) MAPE , HM-ImageNet 0 1 2 3 4 5 6 7 8 9 10 Threshold 10 2 10 3 10 4 10 5 10 6 10 7 MSE CardNet CardNet-A TL-XGB DL-RMI DL-MoE DB-US DL-DLN (c) MSE , ED-AMiner 0 1 2 3 4 5 6 7 8 9 10 Threshold 20 40 60 80 100 MAPE (%) CardNet CardNet-A TL-XGB DL-RMI DL-MoE DB-US DL-DLN (d) MAPE , ED-AMiner 0.0 0.06 0.12 0.18 0.24 0.3 0.36 Threshold 0 10 0 10 1 10 2 10 3 10 4 MSE CardNet CardNet-A TL-XGB DL-RMI DL-MoE DB-US DL-DLN (e) MSE , JC-BMS 0.0 0.06 0.12 0.18 0.24 0.3 0.36 Threshold 0 20 40 60 80 100 MAPE (%) CardNet CardNet-A TL-XGB DL-RMI DL-MoE DB-US DL-DLN (f ) MAPE , JC-BMS 0.0 0.12 0.24 0.36 0.48 0.6 0.72 Threshold 0 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 MSE CardNet CardNet-A TL-XGB DL-RMI DL-MoE DB-US DL-DLN (g) MSE , EU-Glove 300 0.0 0.12 0.24 0.36 0.48 0.6 0.72 Threshold 0 20 40 60 80 100 MAPE (%) CardNet CardNet-A TL-XGB DL-RMI DL-MoE DB-US DL-DLN (h) MAPE , EU-Glov e 300 Figure 5: Accuracy v .s. threshold. 1.5, 1.8, 4.1, and 2.1 times smaller than the b est of the others, respectively . The MAPE is reduced by at least 23.2%, 2.7%, 25.6%, and 21.2% from the best of the others, respectively . In general, de ep learning methods are more accurate than database and traditional learning methods. Among the deep learning methods, DL-DLN ’s p erformance is the worst, DL- MoE is in the middle, and DL-RMI is the runner-up to our models. The p erformance of DL-RMI relies on the models 11 on upper levels. Although the neural networks on upper lev- els discr etize output space into multiple regions, they tend to mispredict the cardinalities that ar e closest to the r egion boundaries. DL-DNN does not deliver go od accuracy , and DL-DNNs 𝜏 is even worse than in some cases due to overtting. This suggests that simply feeding deep neural networks with training data yields limite d performance gain. DL-BiLSTM and DL-BiLSTM-A exhibit small MAPE on ED- AMiner , but are outperformed by DL-RMI in the other cases, suggesting they do not learn the semantics of edit distance very well. In Figure 5, we evaluate the accuracy with varying thresh- olds on the four default datasets. W e compare with the follow- ing mo dels: DB-US , TL-XGB , DL-DLN , DL-MoE , DL-RMI , the more accurate or monotonic models out of each categor y . The general trend is that the errors increase with the threshold, meaning that larger thresholds are harder . The exceptions are MAPE on Hamming distance and MSE on edit distance . The reason is that the cardinalities of some large thresholds tend to resemble for dierent queries, and r egression models are more likely to predict well. 9.3 Estimation Eciency In T able 5, we show the average estimation time. W e also report the time of running a state-of-the-art similarity se- lection algorithm [ 34 , 64 ] to process queries to obtain the cardinality (referred to as SimSelect ). The estimation time of CardNet is close to DL-RMI and faster than the database methods and TL-KDE . Thanks to the acceleration technique, CardNet- A becomes the runner-up, and its speed is close to the fastest model DL-DNN . Car dNet- A is faster than running the similarity sele ction algorithms by at least 24 times. 9.4 Evaluation of Model Comp onents W e evaluate the following components in our models: feature extraction, incremental prediction, variational auto-encoder ( V AE ), and dynamic training strategy . W e use the following radio to measure the improvement by each component: 𝛾 𝜉 = 𝜉 ( CardNet{ - A} -C ) − 𝜉 ( CardNet{ - A} ) 𝜉 ( CardNet{ - A} -C ) , where 𝜉 ∈ { MAPE , MSE mean q-error } , CardNet{ - A} is our model CardNet or CardNet-A , and CardNet{ - A} -C is CardNet{ - A} with component C replaced by other options; e.g., CardNet{ - A} − V AE is our mo del with V AE replaced. A positive 𝛾 𝜉 means the com- ponent has a positive eect in accuracy . W e consider the following replacement options: (1) For feature extraction, we adopt a character-level bidirectional LSTM to transfer a string to a dense representation for e dit distance, a deep set model [ 79 ] to transfer a set to its represen- tation for Jaccard distance, and use the original record as input for Euclidean distance. Hamming distance is not repeate dly tested as we use original v ectors as input. (2) For incremental prediction, w e compare it with a deep neural network that takes as input the concatenation of x ′ and the embe dding of 𝜏 and outputs the cardinality . (3) For V AE , we compare it with an option that directly concatenates the binary repr esenta- tion and distance emb eddings. (4) For dynamic training, we compare it with using only MSLE as loss, i.e., removing the second term on the right side of Equation 3. W e report the 𝛾 𝜉 values on the four default datasets in T able 6. The ee cts of the four components in our models are all p ositive, ranging from 5% to 93% impr ovement of MSE , 5% to 60% improve- ment of MAPE , and 9% to 64% improvement of mean q-error . The most useful comp onent is incremental prediction, with 38% to 93% performance improvement. This demonstrates that using incremental prediction on de ep neural networks is signicantly b etter than directly fe eding neural networks with training data, in accord with what we have observed in T ables 3 and 4. 9.5 Number of Decoders In Figure 6, we evaluate the accuracy by varying the number of de coders. In order to show the tr end clearly , we use four datasets with large lengths or dimensionality , whose statistics is given in T able 7. As seen from the experimental results, we discover that using the largest 𝜏 max setting does not always lead to the best performance. E.g., on HM- Y outub e , the b est performance is achieved when 𝜏 𝑚𝑎𝑥 = 326 (327 decoders). When there are too few deco ders, the feature extraction be- comes lossy and cannot successfully capture the semantic information of the original distance functions. As the number of decoders increases, the feature extraction becomes more eective to capture the semantics. On the other hand, the performance drops if we use an excessive number of decoders. This is be cause given a quer y , the cardinality only increases at a few thresholds (e .g., a threshold of 50 and 51 might produce the same cardinality). Using too many deco ders will involve too many non-incr easing points, p osing diculty in learning the regression model. 9.6 Model Size T able 8 shows the storage sizes of the competitors on the four default datasets. DB-US do es not ne ed any storage and thus shows zero mo del size. TL-KDE has the smallest model size among the others, be cause it only stor es the kernel instances for estimation. For deep learning models, DL-DNN has the smallest model size. Our mo del sizes range fr om 10 to 55 MB, smaller than the other deep models except DL-DNN . 9.7 Evaluation of Training 9.7.1 Training Time. T able 9 shows the training times of var- ious models on the four default datasets. Traditional learning 12 T able 5: A verage estimation time (milliseconds). Model HM-ImageNet HM-PubChem ED- AMiner ED-DBLP JC-BMS JC-DBLP 𝑞 3 EU-Glove 300 EU-Glove 50 SimSelect 5.12 14.68 6.22 10.51 4.24 5.89 14.60 8.52 DB-SE 6.20 8.50 7.64 10.01 4.67 5.78 8.45 7.34 DB-US 1.17 3.60 1.26 6.08 1.75 1.44 6.23 1.05 TL-XGB 0.41 0.41 0.36 0.41 0.71 0.65 0.69 0.60 TL-LGBM 0.32 0.34 0.31 0.33 0.52 0.48 0.49 0.47 TL-KDE 0.83 0.96 4.73 1.24 0.97 2.35 1.28 1.22 DL-DLN 0.42 0.84 0.83 6.43 0.66 0.57 1.23 0.46 DL-MoE 0.21 0.32 0.35 0.59 0.31 0.36 0.41 0.28 DL-RMI 0.37 0.46 0.41 0.57 0.39 0.45 0.68 0.57 DL-DNN 0.09 0.11 0.15 0.25 0.11 0.14 0.15 0.12 DL-DNNs 𝜏 0.26 0.58 0.26 0.62 0.27 0.34 0.42 0.38 DL-BiLSTM - - 3.11 5.22 - - - - DL-BiLSTM- A - - 3.46 5.80 - - - - CardNet 0.36 0.45 0.39 0.69 0.55 0.48 0.67 0.50 CardNet- A 0.13 0.19 0.21 0.29 0.18 0.20 0.24 0.19 T able 6: Performance of model comp onents. Metric Dataset Feature Extraction Incremental Prediction V ariational A uto-encoder Dynamic Training CardNet CardNet- A CardNet CardNet- A CardNet CardNet- A CardNet CardNet- A 𝛾 MSE HM-ImageNet - - 84% 86% 13% 17% 20% 28% ED- AMiner 49% 44% 57% 62% 11% 14% 14% 21% JC-BMS 31% 34% 83% 76% 34% 26% 21% 28% EU-Glove 300 5% 8% 93% 82% 18% 23% 26% 17% 𝛾 MAPE HM-ImageNet - - 47% 46% 14% 16% 15% 16% ED- AMiner 5% 6% 52% 51% 19% 18% 18% 22% JC-BMS 26% 16% 51% 60% 40% 29% 22% 34% EU-Glove 300 32% 21% 54% 48% 23% 14% 32% 27% T able 7: Statistics of datasets with high dimensionality . Dataset Source Process Data Type Attribute # Record ℓ 𝑚𝑎𝑥 ℓ 𝑎 𝑣𝑔 Distance 𝜃 𝑚𝑎𝑥 HM- Y outube Y outube_Faces [7] normalize real-valued vector video 346,194 1770 1770 Euclidean 0.8 HM-GIST 2048 GIST [8] Spe ctral Hashing [73] binary vector image 982,677 2048 2048 Hamming 512 ED-DBLP [4] - string publication title 1,000,000 199 72.49 e dit 20 JC- Wikipedia Wikipedia [9] 3-gram string abstract 1,150,842 732 496.06 Jaccard 0.4 models are faster to train. Our models spend 2 – 4 hours, simi- lar to other deep mo dels. DL-DNNs 𝜏 is the slowest since its has ( 𝜏 𝑚𝑎𝑥 + 1 ) independently learned deep neural networks. 9.7.2 V arying the Size of Training Data. In Figure 7, we show the performance of dierent mo dels by var ying the scale of training examples from 20% to 100% of the original training data. W e only plot MSE due to the page limitation. All the models have worse p erformance with fewer training data, but our models are more robust, showing mo derate accuracy loss. 9.8 Evaluation of Updates W e generate a stream of 200 update op erations, each with an insertion or deletion of 5 records. W e compare three methods: IncLearn that utilizes incremental learning on CardNet- A , Re- train that retrains CardNet- A for each operation, and +Sample T able 8: Model size (MB). Model HM-ImageNet ED- AMiner JC-BMS EU-Glove 300 DB-SE 10.4 31.2 39.4 86.2 DB-US 0.6 0.5 0.6 3.4 TL-XGB 36.4 36.4 48.8 63.2 TL-LGBM 32.6 32.8 45.4 60.4 TL-KDE 4.5 1.5 3.6 18.1 DL-DLN 28.4 75.4 28.6 64.4 DL-MoE 16.8 52.5 35.4 52.5 DL-RMI 57.7 84.8 54.6 66.1 DL-DNN 5.0 14.5 8.7 9.8 DL-DNNs 𝜏 105.4 154.2 183.2 158.4 CardNet 9.6 40.2 16.4 23.8 CardNet- A 16.2 54.5 22.8 35.3 that performs sampling ( DB-US ) on the up dated data and add 13 83 164 245 327 408 489 571 Number of Decoders 10 5 10 6 10 5 MSE CardNet-A TL-XGB DL-RMI DL-MoE DB-US DL-DLN (a) MSE , HM- Y outube 83 164 245 327 408 489 571 Number of Decoders 15 20 25 30 35 MAPE (%) CardNet-A TL-XGB DL-RMI DL-MoE DB-US DL-DLN (b) MAPE , HM- Y outube 35 47 57 74 103 171 257 513 Number of Decoders 10 5 10 6 MSE CardNet-A TL-XGB DL-RMI DL-MoE DB-US DL-DLN (c) MSE , HM-GIST 2048 35 47 57 74 103 171 257 513 Number of Decoders 10 20 30 MAPE (%) CardNet-A TL-XGB DL-RMI DL-MoE DB-US DL-DLN (d) MAPE , HM-GIST 2048 6 7 11 21 Number of Decoders 10 2 10 3 MSE CardNet-A TL-XGB DL-RMI DL-MoE DB-US DL-DLN (e) MSE , ED-DBLP 6 7 11 21 Number of Decoders 30 40 50 MAPE (%) CardNet-A TL-XGB DL-RMI DL-MoE DB-US DL-DLN (f ) MAPE , ED-DBLP 104 206 309 411 513 616 Number of Decoders 10 1 10 2 MSE CardNet-A TL-XGB DL-RMI DL-MoE DB-US DL-DLN (g) MSE , JC- Wikipedia 104 206 309 411 513 616 Number of Decoders 2 4 6 8 10 MAPE (%) CardNet-A TL-XGB DL-RMI DL-MoE DB-US DL-DLN (h) MAPE , JC- Wikipedia Figure 6: Accuracy v .s. number of decoders. the result to that of CardNet- A on the original data. Figure 8 plots the MSE on HM-ImageNet and EU-Glove 300 . W e ob- serve that in most cases, IncLearn has similar performance to Retrain and performs better than +Sample , espe cially when there are more updates. Compared to Retrain that spends several hours to r etrain the model (T able 9), IncLearn only needs 1.2 – 1.5 minutes to perform incremental learning. T able 9: Training time (hours). Model HM-ImageNet ED- AMiner JC-BMS EU-Glove 300 TL-XGB 0.8 0.8 1.0 1.2 TL-LGBM 0.6 0.5 0.7 0.6 TL-KDE 0.3 0.3 0.5 0.6 DL-DLN 4.1 4.6 4.5 4.9 DL-MoE 2.7 3.5 3.8 3.8 DL-RMI 3.2 3.7 3.9 4.4 DL-DNN 1.2 1.5 1.7 1.8 DL-DNNs 𝜏 15 12 12 20 CardNet 3.3 3.4 4.1 4.2 CardNet- A 1.7 2.2 2.4 2.6 20 40 60 80 100 T r aining Siz e 0.5 1.0 1.5 2.0 2.5 MSE (10 4 ) CardNet CardNet-A TL-XGB DL-RMI DL-MoE DL-DLN (a) MSE , HM-ImageNet 20 40 60 80 100 T r aining Siz e 10 5 10 6 MSE CardNet CardNet-A TL-XGB DL-RMI DL-MoE DL-DLN (b) MSE , ED- AMiner 20 40 60 80 100 T r aining Siz e 10 2 10 3 MSE CardNet CardNet-A TL-XGB DL-RMI DL-MoE DL-DLN (c) MSE , JC-BMS 20 40 60 80 100 T r aining Siz e 10 4 10 5 10 6 MSE CardNet CardNet-A TL-XGB DL-RMI DL-MoE DL-DLN (d) MSE , EU-Glove 300 Figure 7: Accuracy v .s. training data size. 0 50 100 150 200 Sequence of Oper ations 2 3 4 5 MSE (10 3 ) Retr ain IncLear n +Sample (a) MSE , HM-ImageNet 0 50 100 150 200 Sequence of Oper ations 0.8 1.0 1.2 1.4 1.6 1.8 2.0 MSE (10 4 ) Retr ain IncLear n +Sample (b) MSE , EU-Glov e 300 Figure 8: Evaluation of updates. 9.9 Evaluation of Long-tail Queries W e compare the performance on long-tail queries, i.e., those having exceptionally large cardinalities ( ≥ 1000). They are outliers and a har d case of estimation. W e divide queries into dierent cardinality groups by every thousand. Figure 9 shows the MSE on the four default datasets by var ying cardinality 14 [1, 2) [2, 3) [3, 4) ≥ 4 Cardinality Range (10 3 ) 10 4 10 5 10 6 MSE CardNet CardNet-A DL-DLN TL-XGB DB-US DL-RMI DL-MoE (a) MSE , HM-ImageNet [1, 2) [2, 4) [4, 6) ≥ 6 Cardinality Range (10 3 ) 10 0 10 1 10 2 MSE (10 5 ) CardNet CardNet-A DL-DLN TL-XGB DB-US DL-RMI DL-MoE (b) MSE , ED- AMiner [0.5, 1) [1.0, 1.2) [1.2, 1.4) ≥ 1.4 Cardinality Range (10 3 ) 10 5 10 6 MSE CardNet CardNet-A DL-DLN TL-XGB DB-US DL-RMI DL-MoE (c) MSE , JC-BMS [1, 2) [2, 4) [4, 6) ≥ 6 Cardinality Range (10 3 ) 10 4 10 5 10 6 10 7 MSE CardNet CardNet-A DL-DLN TL-XGB DB-US DL-RMI DL-MoE (d) MSE , EU-Glove 300 Figure 9: Evaluation of long-tail queries. groups. The MSE increases with cardinality for all the meth- ods. This is expected since the larger the cardinality is, the more exceptional is the query (see Figure 1(b)). Our mo dels outperform the others by 1 to 3 orders of magnitude. More- over , the MSE growth rates of our models w .r .t. cardinality are smaller than the others, suggesting that our mo dels are more robust against long-tail queries. 9.10 Generalizability T o show the generalizability of our models, we evaluate the performance on the queries that signicantly dier from the records in the dataset and the training data. T o prepar e such queries, we rst perform a 𝑘 -medoids clustering on the dataset, and then randomly generate 10,000 out-of-dataset queries and pick the top-2,000 ones having the largest sum of squared distance to the 𝑘 centroids. T o prepare an out-of- dataset query , we generate a random query 𝑞 and accept it only if it is not in the dataset D . Specically , (1) for binary vectors, 𝑞 [ 𝑖 ] ∼ uniform { 0 , 1 } ; (2) for strings, since AMiner and DBLP both contain author names, we take a random author name from the set ( DBLP \ AMiner ) ; (3) for sets, we generate a length 𝑙 ∼ uniform [ 𝑙 min , 𝑙 max ] , where 𝑙 min and 𝑙 max are the minimum and maximum set sizes in D , respectively , and then generate a random set of length 𝑙 sampled from the universe of all the elements in D ; (4) for real vectors, 𝑞 [ 𝑖 ] ∼ uniform [ − 1 , 1 ] . Figure 10 shows the MSE on the four default datasets by var ying cardinality groups. The same trend is witnessed as we have se en for long-tail queries. Due to the use of V AE and dynamic training, our mo dels always p erform [0, 4) [4, 8) [8, 12) ≥ 12 Cardinality Range (10 2 ) 10 2 10 3 10 4 10 5 MSE CardNet CardNet-A DL-DLN TL-XGB DB-US DL-RMI DL-MoE (a) MSE , HM-ImageNet [0, 4) [4, 8) [8, 12) ≥ 12 Cardinality Range (10 2 ) 10 3 10 4 10 5 10 6 MSE CardNet CardNet-A DL-DLN TL-XGB DB-US DL-RMI DL-MoE (b) MSE , ED- AMiner [0, 1) [1, 2) [2, 3) ≥ 3 Cardinality Range (10 2 ) 10 3 10 4 10 5 MSE CardNet CardNet-A DL-DLN TL-XGB DB-US DL-RMI DL-MoE (c) MSE , JC-BMS [0, 4) [4, 8) [8, 12) ≥ 12 Cardinality Range (10 2 ) 10 1 10 2 10 3 10 4 10 5 10 6 MSE CardNet CardNet-A DL-DLN TL-XGB DB-US DL-RMI DL-MoE (d) MSE , EU-Glove 300 Figure 10: Generalizability . Exact CardNet-A DL-RMI TL-XGB DB-US Mean 0 10 0 10 1 10 2 10 3 T otal Processing Time (s) (a) Time, AMiner-Publication Exact CardNet-A DL-RMI TL-XGB DB-US Mean 0 10 0 10 1 10 2 T otal Processing Time (s) (b) Time, AMiner- A uthor Exact CardNet-A DL-RMI TL-XGB DB-US Mean 0 10 0 10 1 10 2 10 3 T otal Processing Time (s) (c) Time, IMDB-Movie Exact CardNet-A DL-RMI TL-XGB DB-US Mean 0 10 0 10 1 10 2 10 3 T otal Processing Time (s) (d) Time, IMDB-Actor Figure 11: Conjunctive euclidean distance query – query processing time. better than the other methods, especially for Jaccard distance. The results demonstrate that our models generalize well for out-of-dataset queries. 9.11 Performance in a Quer y Optimizer 9.11.1 Conjunctive Euclidean Distance ery . W e consider a case study of conjunctive queries. Four textual datasets with 15 T able 10: Statistics of datasets for conjunctive quer y optimizer . Dataset Source Attributes # Records 𝜃 min 𝜃 max AMiner-Publication [3] title, authors, aliations, venue, abstract 2,092,356 0.2 0.5 AMiner- Author [3] name, aliations, research interests 1,712,433 0.2 0.5 IMDB-Movie [10] title type, primar y title, original title, genres 6,250,486 0.2 0.5 IMDB- Actor [10] primar y name, primar y profession 9,822,710 0.2 0.5 Exact CardNet-A DL-RMI TL-XGB DB-US Mean 0.0 0.2 0.4 0.6 0.8 1.0 Precision (a) Time, AMiner-Publication Exact CardNet-A DL-RMI TL-XGB DB-US Mean 0.0 0.2 0.4 0.6 0.8 1.0 Precision (b) Time, AMiner- A uthor Exact CardNet-A DL-RMI TL-XGB DB-US Mean 0.0 0.2 0.4 0.6 0.8 1.0 Precision (c) Time, IMDB-Movie Exact CardNet-A DL-RMI TL-XGB DB-US Mean 0.0 0.2 0.4 0.6 0.8 1.0 Precision (d) Time, IMDB-Actor Figure 12: Conjunctive euclidean distance query – query planning precision. multiple attributes are used (statistics shown in T able 10). Given a dataset, we convert each attribute to a word emb ed- ding (768 dimensions) by Sentence-BERT [ 65 ]. A query is a conjunction of Euclidean distance pr edicates (a.k.a. high- dimensional range predicates [ 51 ]) on normalized word em- beddings, with thresholds uniformly sample d from [ 0 . 2 , 0 . 5 ] ; e.g., “ EU ( name ) ≤ 0 . 25 AND EU ( aliations ) ≤ 0 . 4 AND EU ( research interests ) ≤ 0 . 45 ” , where EU ( ) measures the Euclidean distance b etween the embeddings of a query and a database record. Such queries can b e used for entity matching as blocking rules [20, 28]. T o process a query , we rst nd the records that satisfy one predicate by index lookup ( by a cover tree [ 34 ]), and then check other predicates on the y . W e estimate for each predi- cate and pick the one with the smallest cardinality for index lookup. W e compare CardNet-A with: (1) DB-US , sampling ratio tuned for fastest query processing spe ed; (2) TL-XGB ; (3) DL-RMI ; (4) Mean , which returns the same cardinality for a given threshold; each threshold is quantized to an integer in [ 0 , 255 ] using the threshold transformation in Section 4.4, and then we oine generate 10,000 random queries for each integer in [ 0 , 255 ] and take the mean; and (5) Exact , an oracle that instantly returns the exact cardinality . Figure 11 reports the processing time of 1,000 queries. The time is broken down to cardinality estimation (in blue) and postprocessing (in red, including index lookup and on-the-y check). W e observe: (1) more accurate cardinality estimation (as we have seen in Section 9.2) contributes to faster query processing sp eed; (2) cardinality estimation spends much less time than postprocessing; (3) uniform estimation ( Mean ) has the slowest overall spe ed; (4) deep learning performs better than database and traditional learning methods in both estimation and ov erall speeds; (5) except Exact , Car dNet-A is the fastest and most accurate in estimation, and its overall speed is also the fastest ( by 1.7 to 3.3 times faster than the runner-up DL-RMI ) and even close to Exact . In Figure 12, we show the precision of query planning, i.e., the percentage of queries on which a method picks the fastest (excluding estimation time) plan. The result is in accord with what we hav e observed in Figure 11. The precision of CardNet- A ranges from 90% to 96%, se cond only to Exact . The gap b etween CardNet- A and DL-RMI is within 20%, but results in the speedup of 1.7 to 3.3 times, showcasing the eect of correct quer y planning. W e also obser ve that Exact is not 100% precise, though very close, indicating that smallest cardinality does not always yield the best query plan. Future work on cost estimation may further improve query processing. 9.11.2 Hamming Distance er y . W e also consider a case study of the GPH algorithm [ 63 ], which processes Hamming distance queries over high dimensional vectors through a query optimizer . T o cope with the high dimensionality , the al- gorithm answers a quer y 𝑞 by dividing it into 𝑚 non-overlapping parts and allo cating a threshold (with dynamic programming) to each part using the pigeonhole principle. Each part itself is a Hamming distance selection query that can be answered by bit enumeration and index lookup. The union of the answers of the 𝑚 parts are the candidates of 𝑞 . T o allo cate thresholds and hence to achieve small query processing cost, a quer y optimizer is used to minimize the sum of estimated cardi- nalities of the 𝑚 parts. W e compare CardNet-A with the fol- lowing options: (1) Histogram , the histogram estimator in [ 63 ]; (2) DL-RMI ; (3) Mean , an naiv e estimator that returns the same cardinality for a given threshold (for each thresh- old, we oine generate 10,000 random queries and take the mean); and (4) Exact , an oracle that instantly returns the exact 16 T able 11: Statistics of datasets for Hamming distance quer y optimizer . Dataset Source Process Domain # Records ℓ 𝜃 𝑚𝑎𝑥 HM-PubChem [2] - biological sequence 1,000,000 881 32 HM-UQ Video [69] multiple feature hashing [69] video embedding 1,000,000 128 12 HM-fastT ext [11] spectral hashing [73] wor d embedding 999,999 256 24 HM-EMNIST [18] image binarization image pixel 814,255 784 32 8 16 24 32 Threshold 10 0 10 1 10 2 10 3 T otal Processing Time (s) Exact CardNet-A Histog r am Mean DL-RMI (a) Time, HM-PubChem 8 16 24 32 Threshold 10 − 2 10 − 1 10 0 10 1 10 2 10 3 T otal Processing Time (s) Exact CardNet-A Histog r am Mean DL-RMI (b) Time, HM-UQVideo 8 16 24 32 Threshold 10 − 1 10 0 10 1 10 2 10 3 T otal Processing Time (s) Exact CardNet-A Histog r am Mean DL-RMI (c) Time, HM-fastTe xt 8 16 24 32 Threshold 10 0 10 1 10 2 10 3 T otal Processing Time (s) Exact CardNet-A Histog r am Mean DL-RMI (d) Time, HM-EMNIST Figure 13: Hamming distance query – quer y processing time. cardinality . W e use four datasets. The statistics is given in T able 11, where ℓ denotes the dimensionality . The records are converted to binary vectors as per the process in the table. W e set each part to 32 bits (the last part is smaller if not divisible). Figure 13 reports the pr ocessing time of 1,000 queries by varying thresholds from 8 to 32 on the four datasets. The time is broken down to threshold allocation time (in white, which contains cardinality estimation) and postprocessing time (in red). The performance of CardNet- A is very close to Exact and faster than Histogram by 1.6 to 4.9 times speedup. DL-RMI is slightly faster than Histogram . Mean is much slower (typically one order of magnitude) than other metho ds, suggesting that cardinality estimation is important for this application. The threshold allocation (including cardinality estimation) spends less time than the subsequent quer y processing. CardNet- A also performs better than Histogram in threshold allocation. This is because (1) CardNet- A itself is faster in estimation, and (2) the more accuracy makes the dynamic programming-based allocation terminate earlier . Next we x the threshold at 50% of the maximum threshold in Figure 13 and var y the size of the histogram. Figure 14 10 − 2 10 − 1 10 0 10 1 10 2 Histog r am Siz e (MB) 10 2 10 3 T otal Processing Time (s) Histog r am CardNet-A Mean DL-RMI (a) Time, HM-PubChem 10 − 3 10 − 2 10 − 1 10 0 10 1 Histog r am Siz e (MB) 10 1 10 2 T otal Processing Time (s) Histog r am CardNet-A Mean DL-RMI (b) Time, HM-UQVideo 10 − 2 10 − 1 10 0 10 1 10 2 Histog r am Siz e (MB) 10 2 10 3 T otal Processing Time (s) Histog r am CardNet-A Mean DL-RMI (c) Time, HM-fastTe xt 10 − 2 10 − 1 10 0 10 1 10 2 Histog r am Siz e (MB) 10 2 10 3 T otal Processing Time (s) Histog r am CardNet-A Mean DL-RMI (d) Time, HM-EMNIST Figure 14: Hamming distance query – varying mo del size. reports the average quer y processing time. The positions of other methods (except Exact ) are also marke d in the gure. As expected, the query processing time reduces when using larger histograms. However , the spee d is still 1.6 to 2.6 times slower than CardNet- A even if the size of the histogram ex- ceeds the model size of CardNet- A (see the rightmost point of Histogram ). This result showcases the superiority of our model compared to the traditional database method in an application of cardinality estimation of similarity sele ction. 10 CONCLUSION W e inv estigated utilizing deep learning for cardinality esti- mation of similarity selection. Observing the challenges of this problem and the advantages of using de ep learning, we designed a method composed of two components. The feature extraction component transforms original data and thresh- old to Hamming space, hence to support any data typ es and distance functions. The regression comp onent estimates the cardinality in the Hamming space base d on a deep learning model. W e e xploited the incremental property of cardinality to output monotonic r esults and devised a set of enco der and 17 decoders that estimates the cardinality for each distance value. W e developed a training strategy tailored to our model and proposed optimization te chniques to sp eed up estimation. W e discussed incremental learning for updates. The experimental results demonstrated the accuracy , eciency , and generaliz- ability of the proposed metho d as well as the eectiveness of integrating our metho d to a query optimizer . A CKNO WLEDGMEN TS This work was supp orted by JSPS 16H01722, 17H06099, 18H04093, and 19K11979, NSFC 61702409, CCF DBIR2019001A, NKRDP of China 2018YFB1003201, ARC DE190100663, DP170103710, and DP180103411, and D2D CRC DC25002 and DC25003. The Titan V was donated by Nvidia. W e thank Rui Zhang (the University of Melbourne) for his precious comments. REFERENCES [1] http://ww w .image- net.org/. [2] https://pubchem.ncbi.nlm.nih.gov/. [3] https://aminer .org/. [4] https://dblp2.uni- trier .de/. [5] https://ww w .kdd.org/kdd- cup/view/kdd- cup- 2000. [6] https://nlp.stanford.edu/projects/glove/. [7] http://ww w .cs.tau.ac.il/~wolf/ytfaces/index.html. [8] http://horatio.cs.nyu.edu/mit/tiny/data/index.html. [9] https://wiki.dbpedia.org/ser vices- resources/documentation/datasets. [10] https://ww w .imdb.com/interfaces/. [11] https://fasttext.cc/docs/en/english- vectors.html. [12] C. Anagnostopoulos and P . T riantallou. Query-driven learning for predictive analytics of data subspace cardinality . ACM Transactions on Knowledge Discovery from Data , 11(4):47, 2017. [13] Y . Bengio, A. C. Cour ville, and P . Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern A nalysis and Machine Intelligence , 35(8):1798–1828, 2013. [14] W . Cai, M. Balazinska, and D. Suciu. Pessimistic cardinality estimation: Tighter upp er b ounds for intermediate join cardinalities. In SIGMOD , pages 18–35, 2019. [15] Z. Cao, M. Long, J. W ang, and S. Y . Philip. Hashnet: Deep learning to hash by continuation. In ICCV , pages 5609–5618, 2017. [16] T . Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In KDD , pages 785–794, 2016. [17] J. P. Chiu and E. Nichols. Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics , 4:357–370, 2016. [18] G. Cohen, S. Afshar , J. T apson, and A. van Schaik. EMNIST: extending MNIST to handwritten letters. In IJCNN , pages 2921–2926, 2017. [19] H. Daniels and M. V elikova. Monotone and partially monotone neural networks. IEEE Transactions on Neural Networks , 21(6):906–917, 2010. [20] S. Das, P. S. G. C., A. Doan, J. F. Naughton, G. Krishnan, R. Deep, E. Ar- caute, V . Raghav endra, and Y. Park. Falcon: Scaling up hands-o crowd- sourced entity matching to build cloud services. In SIGMOD , pages 1431–1446, 2017. [21] A. Dasgupta, X. Jin, B. Jewell, N. Zhang, and G. Das. Unbiased estimation of size and other aggregates over hidden w eb databases. In SIGMOD , pages 855–866, 2010. [22] M. Datar , N. Immorlica, P. Indyk, and V . S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In SOCG , pages 253– 262, 2004. [23] B. Ding, S. Das, R. Marcus, W . Wu, S. Chaudhuri, and V . R. Narasayya. AI meets AI: leveraging quer y executions to improve index recommen- dations. In SIGMOD , pages 1241–1258, 2019. [24] A. Dutt, C. W ang, A. Nazi, S. Kandula, V . R. Narasay ya, and S. Chaudhuri. Selectivity estimation for range predicates using lightweight models. PVLDB , 12(9):1044–1057, 2019. [25] M. M. Fard, K. Canini, A. Cotter , J. Pfeifer , and M. Gupta. Fast and exible monotonic functions with ensembles of lattices. In NIPS , pages 2919–2927, 2016. [26] E. Garcia and M. Gupta. Lattice regression. In NIPS , pages 594–602, 2009. [27] A. Gionis, P . Indyk, and R. Motwani. Similarity search in high dimen- sions via hashing. In VLDB , pages 518–529, 1999. [28] C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. W . Shavlik, and X. Zhu. Corle one: hands-o crowdsourcing for entity matching. In SIGMOD , pages 601–612, 2014. [29] M. Gupta, A. Cotter , J. Pfeifer , K. V oevodski, K. Canini, A. Mangylov , W . Mo czydlowski, and A. Van Esbroeck. Monotonic calibrated in- terpolated look-up tables. The Journal of Machine Learning Research , 17(1):3790–3836, 2016. [30] P. J. Haas and A. N. Swami. Se quential Sampling Procedures for Query Size Estimation , volume 21. ACM, 1992. [31] S. Hasan, S. Thirumuruganathan, J. Augustine, N. Koudas, and G. Das. Multi-attribute selectivity estimation using deep learning. CoRR , abs/1903.09999, 2019. [32] M. Heimel, M. Kiefer , and V . Markl. Self-tuning, GP U-accelerated kernel density models for multidimensional selectivity estimation. In SIGMOD , pages 1477–1492, 2015. [33] O. Ivanov and S. Bartunov . Adaptive cardinality estimation. arXiv preprint arXiv:1711.08330 , 2017. [34] M. Izbicki and C. R. Shelton. Faster cover trees. In ICML , pages 1162– 1170, 2015. [35] H. Jiang. Uniform convergence rates for kernel density estimation. In ICML , pages 1694–1703, 2017. [36] L. Jin, C. Li, and R. V ernica. SEPIA: estimating selectivities of approxi- mate string predicates in large databases. The VLDB Journal , 17(5):1213– 1229, 2008. [37] M. Johnson, M. Schuster, Q. V . Le, M. Krikun, Y . Wu, Z. Chen, N. Tho- rat, F. B. Viégas, M. W attenberg, G. Corrado, M. Hughes, and J. Dean. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics , 5:339–351, 2017. [38] G. Ke, Q. Meng, T . Finley , T . W ang, W . Chen, W . Ma, Q. Y e, and T . Liu. Lightgbm: A highly ecient gradient b oosting decision tree. In NIPS , pages 3149–3157, 2017. [39] R. Kemker , M. McClure, A. Abitino, T . L. Hayes, and C. Kanan. Measuring catastrophic forgetting in neural networks. In AAAI , pages 3390–3398, 2018. [40] D. P. Kingma and M. W elling. A uto-encoding variational bayes. arXiv preprint arXiv:1312.6114 , 2013. [41] A. Kipf, T . Kipf, B. Radke, V . Leis, P. A. Boncz, and A. Kemper . Learned cardinalities: Estimating corr elated joins with deep learning. In CIDR , 2019. [42] T . Kraska, M. Alizadeh, A. Beutel, E. H. Chi, A. Kristo, G. Le clerc, S. Mad- den, H. Mao, and V . Nathan. Sagedb: A learned database system. In CIDR , 2019. [43] T . Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index structures. In SIGMOD , pages 489–504, 2018. [44] S. Krishnan, Z. Yang, K. Goldberg, J. Hellerstein, and I. Stoica. Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196 , 2018. 18 [45] H. Lee, R. T . Ng, and K. Shim. Extending q-grams to estimate sele ctivity of string matching with low edit distance. In VLDB , pages 195–206, 2007. [46] H. Lee, R. T . Ng, and K. Shim. Power-law base d estimation of set similarity join size. P VLDB , 2(1):658–669, 2009. [47] H. Lee, R. T . Ng, and K. Shim. Similarity join size estimation using locality sensitive hashing. P VLDB , 4(6):338–349, 2011. [48] V . Leis, B. Radke, A. Gubichev , A. Kemper , and T . Neumann. Cardinality estimation done right: Index-based join sampling. In CIDR , 2017. [49] G. Li, J. He, D . Deng, and J. Li. Ecient similarity join and search on multi-attribute data. In SIGMOD , pages 1137–1151, 2015. [50] P. Li and C. König. b-bit minwise hashing. In W WW , pages 671–680, 2010. [51] K. Lin, H. V . Jagadish, and C. Faloutsos. The tv-tr ee: An index structure for high-dimensional data. The VLDB Journal , 3(4):517–542, 1994. [52] R. J. Lipton and J. F. Naughton. Query size estimation by adaptive sampling. In PODS , pages 40–46, 1990. [53] H. Liu, M. Xu, Z. Yu, V . Corvinelli, and C. Zuzarte. Cardinality estimation using neural networks. In CSSE , pages 53–59, 2015. [54] R. Mar cus and O . Papaemmanouil. Deep reinforcement learning for join order enumeration. In aiDM@SIGMOD , pages 3:1–3:4, 2018. [55] R. C. Marcus, P. Negi, H. Mao, C. Zhang, M. Alizadeh, T . Kraska, O. Pa- paemmanouil, and N. T atbul. Neo: A learned query optimizer . P VLDB , 12(11):1705–1718, 2019. [56] R. C. Marcus and O. Papaemmanouil. P lan-structured deep neural network models for quer y performance prediction. P VLDB , 12(11):1733– 1746, 2019. [57] M. Mattig, T . Fober , C. Beilschmidt, and B. Seeger . Kernel-based cardi- nality estimation on metric data. In EDBT , pages 349–360, 2018. [58] A. Mazeika, M. H. Böhlen, N. K oudas, and D. Srivastava. Estimating the selectivity of approximate string queries. ACM Transactions on Database Systems , 32(2):12, 2007. [59] M. McCloskey and N. J. Cohen. Catastrophic interference in connec- tionist networks: The sequential learning problem. In Psychology of learning and motivation , volume 24, pages 109–165. Elsevier , 1989. [60] J. Ortiz, M. Balazinska, J. Gehrke, and S. S. Keerthi. Learning state rep- resentations for query optimization with deep r einforcement learning. In DEEM@SIGMOD , pages 4:1–4:4, 2018. [61] J. Ortiz, M. Balazinska, J. Gehrke, and S. S. Keerthi. An empirical analysis of deep learning for cardinality estimation. CoRR , abs/1905.06425, 2019. [62] H. Park and L. Stefanski. Relative-error prediction. Statistics & Proba- bility Letters , 40(3):227–236, 1998. [63] J. Qin, Y. W ang, C. Xiao, W . W ang, X. Lin, and Y . Ishikawa. GPH: Similarity search in hamming space. In ICDE , pages 29–40, 2018. [64] J. Qin and C. Xiao. Pigeonring: A principle for faster thresholded similarity search. P VLDB , 12(1):28–42, 2018. [65] N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese b ert-networks. In EMNLP-IJCNLP , pages 3980–3990, 2019. [66] O. Russakovsky , J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy , A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Re cognition Challenge. International Journal of Computer Vision , 115(3):211–252, 2015. [67] N. Shaze er , A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gate d mixture-of-experts layer . arXiv preprint , 2017. [68] K. Sohn, H. Lee, and X. Y an. Learning structured output representation using de ep conditional generativ e models. In NIPS , pages 3483–3491, 2015. [69] J. Song, Y . Y ang, Z. Huang, H. T . Shen, and J. Luo. Eective multiple feature hashing for large-scale near-duplicate vide o retrieval. IEEE Transactions on Multimedia , 15(8):1997–2008, 2013. [70] J. Sun and G. Li. An end-to-end learning-based cost estimator . P VLDB , 13(3):307–319, 2019. [71] I. Trummer . Exact cardinality query optimization with bounded execu- tion cost. In SIGMOD , pages 2–17, 2019. [72] M. T schannen, O. Bachem, and M. Lucic. Recent advances in autoencoder-based representation learning. CoRR , abs/1812.05069, 2018. [73] Y . W eiss, A. T orralba, and R. Fergus. Spectral hashing. In NIPS , pages 1753–1760, 2009. [74] L. W oltmann, C. Hartmann, M. Thiele, D. Habich, and W . Lehner . Cardi- nality estimation with local deep learning models. In aiDM@SIGMOD , pages 5:1–5:8, 2019. [75] W . Wu, J. F. Naughton, and H. Singh. Sampling-based quer y re- optimization. In SIGMOD , pages 1721–1736, 2016. [76] X. Wu, M. Charikar , and V . Natchu. Local density estimation in high dimensions. In ICML , pages 5293–5301, 2018. [77] Z. Y ang, E. Liang, A. Kamsetty , C. Wu, Y. Duan, X. Chen, P. Abbeel, J. M. Hellerstein, S. Krishnan, and I. Stoica. Selectivity estimation with deep likelihood models. CoRR , abs/1905.04278, 2019. [78] S. Y ou, D. Ding, K. Canini, J. Pfeifer , and M. Gupta. Deep lattice networks and partial monotonic functions. In NIPS , pages 2981–2989, 2017. [79] M. Zaheer , S. Kottur , S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov , and A. J. Smola. Deep sets. In NIPS , pages 3391–3401, 2017. [80] H. Zhang and Q. Zhang. Emb edjoin: Ecient e dit similarity joins via embeddings. In KDD , pages 585–594, 2017. [81] J. Zhang, Y. Liu, K. Zhou, G. Li, Z. Xiao, B. Cheng, J. Xing, Y. W ang, T . Cheng, L. Liu, M. Ran, and Z. Li. An end-to-end automatic cloud database tuning system using deep reinforcement learning. In SIGMOD , pages 415–432, 2019. [82] W . Zhang, K. Gao, Y . Zhang, and J. Li. Ecient approximate nearest neighbor search with integrated binary co des. In ACM Multime dia , pages 1189–1192, 2011. [83] Z. Zhao, R. Christensen, F. Li, X. Hu, and K. Yi. Random sampling over joins revisited. In SIGMOD , pages 1525–1539, 2018. 19

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment