The leaf node is identical to that of the $`B^{ed}`$ tree, where each entry contains a string value. The maximum number of strings residing in a leaf node is

\begin{equation}
f_1=\frac{P}{|s|},
\label{eq:f1}
\end{equation}

where $`P`$ is the page size, and $`|s|`$ is the maximum size of a string value.

The internal node’s entry is of the form $`(s^b, p, h, s^e)`$, where $`s^b`$ and $`s^e`$ are the first and last string existing in the subtree rooted at the child node following the $`B^{ed}`$-tree’s string ordering scheme, and $`p`$ points to the corresponding child node. We say that all leaf nodes’ level is 0. A leaf node can have at most $`f_2`$ (Equation [eq:f2]) children nodes. If the internal node’s level is 1, $`h=Hash(s_1||s_2||\dots||s_{f_1})`$, where $`s_1, \dots, s_{f_1}`$ are the strings in the child leaf node. Otherwise, $`h=Hash((s^b_1, h_1, s^e_1)||\dots||(s^b_{f_2}, h_{f_2}, s^e_{f_2}))`$, which is generated from all of its children internal nodes. The client takes the hash value of the root node as the signature of the $`MB^{ed}`$-tree. Figure 2 gives an example to illustrate the structure of a $`MB^{ed}`$-tree.

\begin{equation}
f_2=\frac{P}{2|s|+|h|+|p|},
\label{eq:f2}
\end{equation}

where $`|h|`$ is the length of a hash value, and $`|p|`$ is the point’s size.

The structure of MBTree

Introduction

Big data analytics offers the promise of providing valuable insights. However, many companies, especially the small- and medium-sized organizations lack the computational resources, in-house knowledge and experience of big data analytics. A practical solution to this dilemma is outsourcing, where the data owner outsources the data to a computational powerful third-party service provider (e.g., the cloud) for cost-effective solutions of data storage, processing, and analysis.

In this paper, we consider string similarity search, an important data analytics operation that have been used in a broad range of applications, as the outsourced computations. Generally speaking, the data owner outsources a string database $`D`$ to a third-party service provider (server). The server provides the storage and processing of similarity search queries as services. The search queries ask for the strings in $`D`$ that are similar to a number of given strings, where the similarity is measured by a specific similarity function and a user-defined threshold.

For all the benefits of outsourcing and cloud computing, though, the outsourcing paradigm deprives the data owner of direct control over her data. This poses numerous security challenges. One of the challenges is that the server may cheat on the similarity search results. For example, the server is incentivized to improve its revenue by computing with less resources (e.g., only search a portion of $`D`$) while charging for more. Therefore, it is important to authenticate whether the service provider has performed the search faithfully, and returned the correct results to the client. A naive method is to execute the search queries locally, and compare the results with the outcome from the server. Apparently this method is prohibitively costly. We aim to design efficient methods that enable the client to authenticate that the server returned sound and complete similar strings. By soundness we mean that the returned strings are indeed similar. By completeness we mean that all similar strings are returned. In this paper, we focus on edit distance, a commonly-used string similarity function.

Most existing work (e.g. ) solve the authentication problem for spatial queries in the Euclidean space. To our best knowledge, ours is the first to consider the authentication of outsourced string similarity search. Intuitively, the strings can be mapped to the Euclidean space via a similarity-preserving embedding function (e.g. ). However, such embedding functions cannot guarantee 100% precision (i.e., the embedded points of some dissimilar strings become similar in the Euclidean space). This disables the direct use of the existing Euclidean distance based authentication approaches on string similarity queries.

In this paper, we design $`AutoS^3`$, an Authentication mechanism of Outsourced String Similarity Search. The key idea of $`AutoS^3`$ is that besides returning the similar strings, the server returns a verification object ($`VO`$) that can prove the soundness and completeness of returned strings. In particular, we make the following contributions.

First, we design an authentication tree structure named MB-tree. MB-tree is constructed by integrating Merkle hash tree , a popularly-used authenticated data structure, with $`B^{ed}`$-tree , a compact index for efficient string similarity search based on edit distance.

Second, we design the basic verification method named $`VS^2`$ for the search queries that consist of a single query string. $`VS^2`$ constructs $`VO`$ from the MB-tree, requiring to include false hits into $`VO`$, where false hits refer to the strings that are not returned in the result, but are necessary for the result authentication. We prove that $`VS^2`$ is able to catch the server’s cheating behaviors such as tampered values, soundness violation, and completeness violation.

A large amounts of false hits can impose a significant burden to the client for verification. Therefore, our third contribution is the design of the E-$`VS^2`$ algorithm that reduces the VO verification cost at the client side. E-$`VS^2`$ applies a similarity-preserving embedding function to map strings to the Euclidean space in the way that similar strings are mapped to close Euclidean points. Then $`VO`$ is constructed from both the $`MB`$-tree and the embedded Euclidean space. Compared with $`VS^2`$, E-$`VS^2`$ dramatically saves the verification cost by replacing a large amounts of expensive string edit distance calculation with a small number of cheap Euclidean distance computation.

Fourth, we extend to the authentication of: (1) similarity search queries that consists of multiple query strings, and (2) top-k similarity search. We design efficient optimization methods that reduce verification cost for both cases.

Last but not least, we complement the theoretical investigation with a rich set of experiment study on real datasets. The experiment results demonstrate the efficiency of our approaches. It shows that E-$`VS^2`$ can save 25% verification cost of the $`VS^2`$ approach.

The rest of the paper is organized as follows. Sections 18 and 13 discuss the related work and preliminaries. Section 14 formally defines the problem. Section 20 presents our $`VS^2`$ and E-$`VS^2`$ approaches for single-string search queries. Section 19 discusses the authentication of multi-string search queries. Section 15 extends to top-k similarity search. The experiment results are shown in Section 16. Section 12 concludes the paper.

The leaf node entry in the $`MB^{ed}`$-tree is of the form $`(s, p, h)`$, where $`s`$ is the string value, $`p`$ is the pointer to the disk block that stores $`s`$, and $`h=Hash(s)`$. The maximum number of entries in a leaf node is $`f_1=\frac{P}{|p|+|s|+|h|}`$. The internal node’s entry is $`(s^b, p, h, s^e)`$, where $`s^b`$ and $`s^e`$ are the first and last string existing in the subtree rooted at the child node following the $`B^{ed}`$-tree’s string ordering scheme, and $`p`$ points to the corresponding child node. An internal node can have at most $`f_2=\frac{P}{|p|+|h|+2|s|}`$ children nodes. The hash value in an entry is $`h=Hash(s^b||h_1||dots||h_{f_2}||s^e)`$. Let $`N_R`$ denote the root node. The client takes the hash value from the concatenation of $`s^b`$, $`s^e`$ and the hash value of every entry in the root node and takes the it as the $`MB^{ed}`$-tree’s signature. An example of the $`MB^{ed}`$-tree structure is presented in Figure 2.

\begin{equation}
f_2=\frac{P}{2|s|+|h|+|p|},
\label{eq:f2}
\end{equation}

where $`|h|`$ is the length of a hash value, and $`|p|`$ is the point’s size.

The structure of MBTree