Conclusions

In this paper, we proposed CNN-ED, a model that uses convolutional neural network (CNN) to embed edit distance into Euclidean distance. A complete pipeline (including input preparation, loss function and sampling method) is formulated to train the model end to end and theoretical analysis is conducted to justify choosing CNN as the model structure. Extensive experimental results show that CNN-ED outperforms existing edit distance embedding method in terms of both accuracy and efficiency. Moreover, CNN-ED shows promising performance for edit distance similarity search and is robust to different hyper-parameter configurations. We believe that incorporating CNN embeddings to design efficient string similarity search frameworks is a promising future direction.

Acknowledgments. We thank the reviewers for their valuable comments. This work was partially supported GRF 14208318 from the RGC and ITF 6904945 from the ITC of HKSAR.

Introduction

Given two strings $`s_x`$ and $`s_y`$, their edit distance $`\ED(s_x, s_y)`$ is the minimum number of edit operations (i.e., insertion, deletion and substitution) required to transform $`s_x`$ into $`s_y`$ (or $`s_y`$ into $`s_x`$). As a metric, edit distance is widely used to evaluate the similarity between strings. Edit-distance-based string similarity search has many important applications including spell corrections, data de-duplication, entity linking and sequence alignment .

The high computational complexity of edit distance is the main obstacle for string similarity search, especially for large datasets with long strings. For two strings with length $`l`$, computing their edit distance has $`\mathcal{O}(l^2/\log(l))`$ time complexity using the best algorithm known so far . There are evidences that this complexity cannot be further improved . Pruning-based solutions have been used to avoid unnecessary edit distance computation . However, it is reported that pruning-based solutions are inefficient when a string and its most similar neighbor have a large edit distance , which is common for datasets with long strings.

Metric embedding has been shown to be successful in bypassing distances with high computational complexity (e.g., Wasserstein distance ). For edit distance, a metric embedding model can be defined by an embedding function $`f(\cdot)`$ and a distance measure $`d(\cdot, \cdot)`$ such that the distance in the embedding space approximates the true edit distance, i.e., $`\ED(s_x, s_y) \!\approx\! d\left(f(s_x), f(s_y)\right)`$. A small approximation error ($`|\ED(s_x, s_y)-d\left(f(s_x), f(s_y)\right)|`$) is crucial for metric embedding. For similarity search applications, we also want the embedding to preserve the order of edit distance. That is, for a triplet of strings, $`s_x`$, $`s_y`$ and $`s_z`$, with $`\ED(s_x, s_y)<\ED(s_x, s_z)`$, it should ensure that $`d\left(f(s_x), f(s_y)\right)\!<\!d\left(f(s_x), f(s_z)\right)`$. In this paper, we evaluate the accuracy of the embedding methods using both approximation error and order preserving ability.

Several methods have been proposed for edit distance embedding. Ostrovsky and Rabani embed edit distance into $`\ell_1`$ with a distortion¹ of $`2^{\mathcal{O}(\sqrt{\log l \log \log l})}`$ but the algorithm is too complex for practical implementation. The CGK algorithm embeds edit distance into Hamming distance and the distortion is $`\mathcal{O}(\ED)`$ , in which $`\ED`$ is the true edit distance. CGK is simple to implement and shown to be effective when incorporated into a string similarity search pipeline. Both Ostrovsky and Rabani’s method and CGK are data-independent while learning-based methods can provide better embedding by considering the structure of the underlying dataset. GRU trains a recurrent neural network (RNN) to embed edit distance into Euclidean distance. Although GRU outperforms CGK, its RNN structure makes training and inference inefficient. Moreover, its output vector (i.e., $`f(s_x)`$) has a high dimension, which results in complicated distance computation and high memory consumption. As our main baseline methods, we discussion CGK and GRU in more details in Section 4.

To tackle the problems of GRU, we propose CNN-ED, which embeds edit distance into Euclidean distance using a convolutional neural network (CNN). The CNN structure allows more efficient training and inference than RNN, and we constrain the output vector to have a relatively short length (e.g., 128). The loss function is a weighted combination of the triplet loss and the approximation error, which enforces accurate edit distance approximation and preserves the order of edit distance at the same time. We also conducted theoretical analysis to justify our choice of CNN as the model structure, which shows that the operations in CNN preserve edit distance to some extent. In contrasts, similar analytical results are not known for RNN. As a result, we observed that for some datasets a randomly initialized CNN (without any training) already provides better embedding than CGK and fully trained GRU.

We conducted extensive experiments on 5 datasets with various cardinalities and string lengths. The results show that CNN-ED outperforms both CGK and GRU in approximation accuracy, computation efficiency, and memory consumption. The approximation error of CNN-ED can be only 50% of GRU even if CNN-ED uses an output vector that is two orders of magnitude shorter than GRU. For training and inference, the speedup of CNN-ED over GRU is up to 30x and 200x, respectively. Using the embeddings for string similarity join, CNN-ED outperforms EmbedJoin , a state-of-the-art method. For threshold based string similarity search, CNN-ED reaches a recall of 0.9 up to 200x faster compared with HSsearch . Moreover, CNN-ED is shown to be robust to hyper-parameters such as output dimension and the number of layers.

To summarize, we made three contributions in this paper. First, we propose a CNN-based pipeline for edit distance embedding, which outperforms existing methods by a large margin. Second, theoretical evidence is provided for using CNN as the model for edit distance embedding. Third, extensive experiments are conducted to validate the performance of the proposed method.

The rest of the paper is organized as follows. Section 4 introduces the background of string similarity search and two edit distance embedding algorithms, i.e., CGK and GRU. Section 5 presents our CNN-based pipeline and conduct theoretical analysis to justify using CNN as the model. Section 6 provides experimental results about the accuracy, efficiency, robustness and similarity search performance of the CNN embedding. The concluding remarks are given in Section 7.

An embedding method is said to have a distortion of $`\gamma`$ if there exists a positive constant $`\lambda`$ that satisfies $`\lambda \ED(s_x, s_y) \!\le\!d\left(f(s_x), f(s_y)\right)\!\le \!\gamma \lambda \ED(s_x, s_y)`$, in which $`\lambda`$ is a scaling factor . ↩︎