Unique decodability of bigram counts by finite automata

Reading time: 5 minute
...

📝 Original Info

  • Title: Unique decodability of bigram counts by finite automata
  • ArXiv ID: 1111.6431
  • Date: 2023-11-15
  • Authors: M. L. Lia, X. Xie —

📝 Abstract

We revisit the problem of deciding whether a given string is uniquely decodable from its bigram counts by means of a finite automaton. An efficient algorithm for constructing a polynomial-size nondeterministic finite automaton that decides unique decodability is given. Conversely, we show that the minimum deterministic finite automaton for deciding unique decodability has at least exponentially many states in alphabet size.

💡 Deep Analysis

📄 Full Content

Reconstructing a string from its snippets is a problem of fundamental importance in many areas of computing. In a biological context this problem amounts to sequencing of DNA from short reads [6] and reconstruction of protein sequences from K-peptides [9]. Communications protocols [3,8] recombine snippets from related documents to identify differences between them, and fuzzy extractors [10] use similar techniques for producing keys from noise-prone biometric data. Computational linguistics also makes occasional use of this snippet representation (under the name Wickelfeatures [1]), as a means to learn transformations on varying-length sequences.

In general, there may be a large number of possible string reconstructions from a given collection of overlapping snippets; for example, the snippets {at, an, ka, na, ta} can be combined into katana or kanata. In order to keep the decoding complexity and ambiguity low, it is desirable in practice to choose a snippet length that allows only a few distinct reconstructions -the ideal number being exactly one.

Main results. We consider the problem of efficiently determining whether a collection of snippets has a unique reconstruction. More precisely, we construct a nondeterministic finite automaton (NFA) on O(|Σ| 3 ) states that recognizes precisely those strings over the alphabet Σ that have a unique reconstruction. Our NFA has a particularly simple form that provides for an easy and efficient implementation, and runs on a string of length ℓ in time O(ℓ|Σ| 3 ) and constant memory. We further show that the minimum equivalent deterministic finite automaton has at least 2 |Σ|-1 states. This lower bound is still far off from the upper bound 2 O(|Σ| log |Σ|) implicit in [11] and closing this gap is an intriguing open problem.

It was shown in [7] that the collection of strings having a unique reconstruction from the snippet representation is a regular language. An explicit construction of a deterministic finite-state automaton (DFA) recognizing this language was given in by Lia and Xie [11]. Unfortunately, this DFA has

states, and thus is not practical except for very small alphabets. As we show in this paper, there is no DFA of subexponential size for recognizing this language; however, we exhibit an equivalent NFA with O(|Σ| 3 ) states.

Outline. We proceed in Section 2 with some preliminary definitions and notation. In Section 3 we present our construction of an NFA recognizing uniquely decodable strings, and we prove its correctness in Section 4. Finally, we present a new lower bound on the size of a DFA accepting uniquely decodable strings in Section 5, and conclude in Section 6 with discussion and an open problem.

We assume a finite alphabet Σ along with a special delimiter character $ / ∈ Σ, and define Σ $ = Σ ∪ {$}. For k ≥ 1, the k-gram map Φ takes string

x ∈ $Σ * $ to a vector ξ ∈ N Σ k $ , where ξ i 1 ,...,i k ∈ N is the number of times the string i 1 . . . i k ∈ Σ k occurred in x as a contiguous subsequence, counting overlaps. 1 As we have seen, the bigram map Φ : $Σ * $ → N Σ 2 $ is not injective; for example, Φ($katana$) = Φ($kanata$).

We denote by L UNIQ ⊆ Σ * the collection of all strings w for which

and refer to these strings as uniquely decodable, meaning that there is exactly one way to reconstruct them from their bigram snippets We also follow the standard conventions for sets, languages, regular expressions, and automata [2,4,5]. As such, a factor of a string (colloquially a snippet) is any of its contiguous substrings. The term Σ * denotes the free monoid over the alphabet Σ, and, for S ⊆ Σ, the term S * has the usual regular-expression interpretation; the language defined by a regular expression R will be denoted L(R). In addition, we will denote the omission of a symbol from the alphabet by Σ x := Σ \ {x} for x ∈ Σ.

Finally, we shall use the standard five-tuple [4] notation (Σ, Q, q 0 , δ, F ) to specify a given DFA, where Σ is the input alphabet, Q is the set of states, q 0 is the initial state, δ is the transition function, and F are the final states; an analogous notation is used for NFAs. We use the notation |•| both to denote the size of an automaton (measured by the number of states) and the length of a string.

Our starting point is the observation, also made in [11], that L UNIQ is a factorial language, meaning that it is closed under taking factors. From here, Lia and Xie [11] proceed to characterize L UNIQ in terms of its minimal whose elements will be called obstructions. The language of all obstructions will be denoted

The DFA recognizing a typical K x,a,b is illustrated in Figure 1. One can verify that these DFAs indeed recognize K x,a,b straightforwardly for Σ = {a, b, x}, and note that the automata continue to be correct for any Σ ′ ⊇ {a, b, x}. An important feature of K x,a,b is that 9 states always suffice for its DFA, regardless of Σ (one can also check that the DFAs given in Figure 1 are canonical by applying the DFA minim

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut