On the d-complexity of strings
This paper deals with the complexity of strings, which play an important role in biology (nucleotid sequences), information theory and computer science. The d-complexity of a string is defined as the number of its distinct d-substrings given in Definition 1. The case d=1 is studied in detail.
š” Research Summary
The paper introduces a novel metric for quantifying the structural richness of finite strings, called dācomplexity. A dāsubstring of a string S of length n is defined as any subsequence of characters whose indices iā < iā < ⦠< i_k satisfy the distance constraint |i_{j+1} ā i_j| ⤠d for every adjacent pair. The set of all distinct dāsubstrings is denoted D_d(S), and the dācomplexity C_d(S) is simply the cardinality |D_d(S)|. This definition generalizes the classic notion of contiguous substrings (the case d = 1) by allowing limited gaps, thereby bridging the gap between strictly local and fully nonālocal analyses.
The authors first establish basic combinatorial bounds. They prove that C_d(S) ⤠nĀ·Ļ^d, where Ļ is the alphabet size, showing that the number of admissible dāsubstrings grows at most linearly with the string length and exponentially with the gap parameter d. Importantly, for any fixed d the exact value of C_d(S) can be computed in polynomial time. To this end, they design an O(nĀ·dĀ·Ļ) algorithm that combines a slidingāwindow scan with a trie (prefix tree) that stores each encountered dāsubstring. Because each window can generate at most Ļ^d distinct extensions, the algorithm inserts each candidate once, guaranteeing linearātime behavior in n for constant d and Ļ.
The bulk of the paper is devoted to the special case d = 1, i.e., ordinary contiguous substrings. Here the authors derive a precise formula linking C_1(S) to the suffix array and the longestācommonāprefix (LCP) array:
āC_1(S) = n(nāÆ+āÆ1)/2āÆāāÆā_{i=1}^{nā1} LCP
Comments & Academic Discussion
Loading comments...
Leave a Comment