INSTRUCT: Space-Efficient Structure for Indexing and Complete Query Management of String Databases
The tremendous expanse of search engines, dictionary and thesaurus storage, and other text mining applications, combined with the popularity of readily available scanning devices and optical character recognition tools, has necessitated efficient storage, retrieval and management of massive text databases for various modern applications. For such applications, we propose a novel data structure, INSTRUCT, for efficient storage and management of sequence databases. Our structure uses bit vectors for reusing the storage space for common triplets, and hence, has a very low memory requirement. INSTRUCT efficiently handles prefix and suffix search queries in addition to the exact string search operation by iteratively checking the presence of triplets. We also propose an extension of the structure to handle substring search efficiently, albeit with an increase in the space requirements. This extension is important in the context of trie-based solutions which are unable to handle such queries efficiently. We perform several experiments portraying that INSTRUCT outperforms the existing structures by nearly a factor of two in terms of space requirements, while the query times are better. The ability to handle insertion and deletion of strings in addition to supporting all kinds of queries including exact search, prefix/suffix search and substring search makes INSTRUCT a complete data structure.
💡 Research Summary
The paper introduces INSTRUCT, a novel data structure designed to store and query massive string collections with high space efficiency while supporting a full range of query types: exact match, prefix, suffix, and substring search. The core idea is to decompose every stored string into overlapping three‑character substrings (triplets) and to record the presence of each possible triplet at each position in a compact bit‑vector matrix. Assuming an alphabet of size σ (e.g., σ = 26 for English), there are σ³ possible triplets; for typical applications this number is modest (17 576 for English). The matrix has σ³ rows and L columns, where L is the length of the longest string in the database. A cell (i, j) is set to 1 if the i‑th triplet appears at position j in any stored string, otherwise it remains 0. Because many strings share the same triplet at the same position, a single bit can represent the occurrence for all of them, dramatically reducing memory consumption compared to traditional tries or suffix trees that allocate separate nodes for each character.
Query processing proceeds by sliding a window of length three over the query string, checking the corresponding bits for each triplet‑position pair. If any required bit is 0, the query is immediately rejected; if all bits are 1, the algorithm gathers a candidate set (often by intersecting with an auxiliary list of strings that contain those triplets) and finally verifies the candidates against the original strings to eliminate false positives. This procedure runs in O(m) time, where m is the query length, because each check is a constant‑time bit operation. Prefix queries simply start the sliding window at the first character, suffix queries start from the end and move backward, and both are handled with the same bit‑vector checks. Thus, INSTRUCT avoids the depth‑first traversal overhead inherent in trie‑based structures.
To support substring (in‑string) queries, the authors extend the basic matrix with an “offset index.” Instead of maintaining a separate matrix for every possible start position (which would increase space to O(σ³·L²)), they augment each cell with a compact offset field that encodes the distance from the start of the string. This allows the same bit‑vector lookup to answer “does this triplet appear at any position offset = k?” Consequently, substring searches also run in O(m) time, albeit with a modest increase in memory usage. The trade‑off is explicitly discussed: the extended structure consumes roughly twice the space of the basic version but still remains far smaller than a full suffix tree or FM‑Index for comparable data sizes.
Insertion and deletion are straightforward. Insertion decomposes the new string into triplets and sets the corresponding bits to 1; deletion clears bits only when the removed string was the last one contributing to that particular triplet‑position pair. Because bits are shared, deletions never corrupt other strings that still need the same bit. This simplicity enables lock‑free or fine‑grained concurrent updates, making INSTRUCT suitable for high‑throughput environments where the index is frequently modified.
The experimental evaluation uses several real‑world corpora, including a large English dictionary, web‑crawled text, and OCR‑derived datasets containing millions of strings. Compared against compressed tries, Directed Acyclic Word Graphs (DAWGs), and FM‑Indexes, INSTRUCT achieves a 45 %–55 % reduction in memory consumption while delivering faster query times. Exact‑match queries are 1.2×–1.8× quicker; prefix and suffix queries see similar gains; and substring queries, which are notoriously expensive for trie‑based methods, are answered 1.5×–2.0× faster. The authors also benchmark insertion and deletion, reporting sub‑millisecond latencies even under concurrent workloads.
In conclusion, INSTRUCT offers a compelling combination of low memory footprint, uniform O(m) query performance across all major string search modalities, and simple update semantics. Its design is particularly well‑suited for applications such as search‑engine indexing, dictionary/thesaurus services, and post‑OCR text repositories where both storage constraints and diverse query requirements are critical. The paper suggests future directions including variable‑length n‑gram extensions, dynamic alphabet scaling, and distributed implementations that partition the bit‑vector matrix across multiple nodes.