Languages of lossless seeds

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Several algorithms for similarity search employ seeding techniques to quickly discard very dissimilar regions. In this paper, we study theoretical properties of lossless seeds, i.e., spaced seeds having full sensitivity. We prove that lossless seeds coincide with languages of certain sofic subshifts, hence they can be recognized by finite automata. Moreover, we show that these subshifts are fully given by the number of allowed errors k and the seed margin l. We also show that for a fixed k, optimal seeds must asymptotically satisfy l ~ m^(k/(k+1)).

💡 Research Summary

The paper investigates lossless seeds—spaced seed patterns that guarantee full sensitivity for approximate string matching under a Hamming distance bound k. The authors recast the seed concept in the language of formal language theory and symbolic dynamics. A seed Q is a finite word over the alphabet {#, –}, where ‘#’ denotes a required match and ‘–’ a wildcard. The weight of Q is the number of # symbols, and the seed margin ℓ is defined as ℓ = m – |Q|, where m is the length of the strings being compared. A seed solves the (m, k) problem if, for every set of k error positions in a length‑m string, there exists a shift t ∈ {0,…,ℓ} such that none of the # positions of Q fall on an error.

The central technical contribution is Theorem 1, which translates the detection condition into a statement about the logical OR (⊕) of k shifted copies of an infinite word w that embeds Q surrounded by infinite – symbols. Specifically, Q detects an error set {i₁,…,i_k} at shift t iff the OR of the shifted copies σ_{i₁}(w),…,σ_{i_k}(w) contains only – symbols on the interval

Languages of lossless seeds

💡 Research Summary

Comments & Academic Discussion

Leave a Comment