String Indexing for Patterns with Wildcards

Reading time: 6 minute
...

📝 Original Info

  • Title: String Indexing for Patterns with Wildcards
  • ArXiv ID: 1110.5236
  • Date: 2009-06-01
  • Authors: A. Amir, G. M. Landau, S. Muthukrishnan —

📝 Abstract

We consider the problem of indexing a string $t$ of length $n$ to report the occurrences of a query pattern $p$ containing $m$ characters and $j$ wildcards. Let $occ$ be the number of occurrences of $p$ in $t$, and $\sigma$ the size of the alphabet. We obtain the following results. - A linear space index with query time $O(m+\sigma^j \log \log n + occ)$. This significantly improves the previously best known linear space index by Lam et al. [ISAAC 2007], which requires query time $\Theta(jn)$ in the worst case. - An index with query time $O(m+j+occ)$ using space $O(\sigma^{k^2} n \log^k \log n)$, where $k$ is the maximum number of wildcards allowed in the pattern. This is the first non-trivial bound with this query time. - A time-space trade-off, generalizing the index by Cole et al. [STOC 2004]. We also show that these indexes can be generalized to allow variable length gaps in the pattern. Our results are obtained using a novel combination of well-known and new techniques, which could be of independent interest.

💡 Deep Analysis

Figure 1

📄 Full Content

The string indexing problem is to build an index for a string t such that the occurrences of a query pattern p can be reported. The classic suffix tree data structure [38] combined with perfect hashing [15] gives a linear space solution for string indexing with optimal query time, i.e., an O(n) space data structure that supports queries in O(m + occ) time, where occ is the number of occurrences of p in t.

Recently, various extensions of the classic string indexing problem that allow errors or wildcards (also known as gaps or don’t cares) have been studied [6,11,24,28,32,36,37]. In this paper, we focus on one of the most basic of these extensions, namely, string indexing for patterns with wildcards. In this problem, only the pattern contains wildcards, and the goal is to report all occurrences of p in t, where a wildcard is allowed to match any character in t.

String indexing for patterns with wildcards finds several natural applications in large-scale data processing areas such as information retrieval, bioinformatics, data mining, and internet traffic analysis. For instance in bioinformatics, the PROSITE data base [5,21] supports searching for protein patterns containing wildcards. Despite significant interest in the problem and its many variations, most of the basic questions remain unsolved. We introduce three new indexes and obtain several new bounds for string indexing with wildcards in the pattern. If the index can handle patterns containing an unbounded number of wildcards, we call it an unbounded wildcard index, otherwise we refer to the index as a k-bounded wildcard index, where k is the maximum number of wildcards allowed in p. Let n be the length of the indexed string t, and σ be the size of the alphabet. We define m and j to be the number of characters and wildcards in p, respectively. Consequently, the length of p is m + j. We show that,

• There is an unbounded wildcard index with query time O(m + σ j log log n + occ) using linear space. This significantly improves the previously best known linear space index by Lam et al. [24], which requires query time Θ(jn) in the worst case. Compared to the index by Cole et al. [11] having the same query time, we improve the space usage by a factor log n.

• There is a k-bounded wildcard index with query time O(m+j+occ) using space O(σ k 2 n log k log n). This is the first non-trivial space bound with this query time.

• There is a time-space trade-off for k-bounded wildcard indexes. This trade-off generalizes the index described by Cole et al. [11].

Furthermore, we generalize these indexes to support variable length gaps in the pattern.

Exact string matching has been generalized with error bounds in many different ways. In particular, allowing matches within a bounded hamming or edit distance, known as approximate string matching, has been subject to much research [2, 6, 10-12, 19, 25, 26, 28, 32, 35, 37]. Another generalization was suggested by Fischer and Paterson [14], allowing wildcards in the text or pattern. Work on the wildcard problem has mostly focused on the non-indexing variant, where the string t is not preprocessed in advance [4,8,9,13,14,23]. Some solutions to the indexing problem consider the case where wildcards appear only in the indexed string [36] or in both the string and the pattern [11,24].

In the following, we summarize the known indexes that support wildcards in the pattern only. We focus on the case where k > 1, since for k = 0 the problem is classic string indexing. For k = 1, Cole et al. [11] describe a selection of specialized solutions. However, these solutions do not generalize to larger k.

Several simple solutions to the problem exist for k > 1. Using a suffix tree T for t [38], we can find all occurrences of p in a top-down traversal starting from the root. When we reach a wildcard character in p in location ℓ ∈ T , the search branches out, consuming the first character on all outgoing edges from ℓ. This gives an unbounded wildcard index using O(n) space with query time O(σ j m + occ), where occ is the total number of occurrences of p in t. Alternatively, we can build a compressed trie storing all possible modifications of all suffixes of t containing at most k wildcards. This gives a k-bounded wildcard index using O(n k+1 ) space with query time O(m + j + occ).

In 2004, Cole et al. [11] gave an elegant k-bounded wildcard index using O(n log k n) space and with O(m + 2 j log log n + occ) query time. For sufficiently small values of j this significantly improves the previous bounds. The key components in this solution are a new data structure for longest common prefix (LCP) queries and a heavy path decomposition [20] of the suffix tree for the text t. Given a pattern p, the LCP data structure supports efficient insertion of all suffixes of p

Table 1: † = presented in this paper. The term occ(p i , t) denotes the number of matches of p i in t and is Θ(n) in the worst case.

into the suffix tree for t, such that subseq

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut