A Subsequence-Histogram Method for Generic Vocabulary Recognition over Deletion Channels

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We consider the problem of recognizing a vocabulary–a collection of words (sequences) over a finite alphabet–from a potential subsequence of one of its words. We assume the given subsequence is received through a deletion channel as a result of transmission of a random word from one of the two generic underlying vocabularies. An exact maximum a posterior (MAP) solution for this problem counts the number of ways a given subsequence can be derived from particular subsets of candidate vocabularies, requiring exponential time or space. We present a polynomial approximation algorithm for this problem. The algorithm makes no prior assumption about the rules and patterns governing the structure of vocabularies. Instead, through off-line processing of vocabularies, it extracts data regarding regularity patterns in the subsequences of each vocabulary. In the recognition phase, the algorithm just uses this data, called subsequence-histogram, to decide in favor of one of the vocabularies. We provide examples to demonstrate the performance of the algorithm and show that it can achieve the same performance as MAP in some situations. Potential applications include bioinformatics, storage systems, and search engines.

💡 Research Summary

The paper tackles a fundamental inference problem that arises when a string drawn from one of two candidate vocabularies is transmitted through a deletion channel and only a subsequence of the original string is observed. Formally, let Σ be a finite alphabet and let V₁ and V₂ be two finite sets of words over Σ. A word w is first sampled uniformly (or according to a known prior) from one of the vocabularies, then each symbol of w is independently deleted with probability p, producing a subsequence s that preserves the original order of the retained symbols. Given s, the task is to decide whether the hidden word originated from V₁ or from V₂, i.e., to compute the posterior probabilities P(V₁|s) and P(V₂|s) and select the larger one (the MAP decision).

The naïve MAP solution requires evaluating P(s|V) = Σ_{w∈V} P(s|w)·P(w) for each vocabulary. The term P(s|w) equals the number of distinct deletion patterns that turn w into s, which is combinatorial: for a word of length n and a subsequence of length m the number of patterns is C(n,m). Consequently, exact MAP inference has exponential time and space complexity in the length of the words, rendering it impractical for realistic vocabularies (e.g., genomic databases, large text corpora, or storage system dictionaries).

To overcome this bottleneck, the authors propose a subsequence‑histogram approach that separates an offline preprocessing phase from an online recognition phase. During preprocessing, each vocabulary V is scanned exhaustively: for every word w∈V all its subsequences of every possible length are generated, and a count is kept for each distinct subsequence t. The resulting data structure H_V is a multi‑dimensional histogram indexed by subsequence length k and the subsequence string t, i.e., H_V

A Subsequence-Histogram Method for Generic Vocabulary Recognition over Deletion Channels

💡 Research Summary

Comments & Academic Discussion

Leave a Comment