The Inverse Lyndon Array: Definition, Properties, and Linear-Time Construction

The Lyndon array stores, at each position of a word, the length of the longest maximal Lyndon subword starting at that position, and plays an important role in combinatorics on words, for example in the construction of fundamental data structures suc…

Authors: Pietro Negri, Manuel Sica, Rocco Zaccagnino

The Inverse Lyndon Array: Definition, Properties, and Linear-Time Construction
The In v erse Lyndon Arra y: Definition, Prop erties, and Linear-Time Construction Pietro Negri 1 , Man uel Sica 1 , Ro cco Zaccagnino 1 , and Rosalba Zizza 1 Departmen t of Computer Science, Universit y of Salerno, Fisciano (SA), Italy {masica, rzaccagnino, rzizza}@unisa.it, p.negri137@gmail.com Abstract. The Lyndon arra y stores, at each position of a word, the length of the longest maximal Lyndon subw ord starting at that p osition, and plays an important role in com binatorics on words, for example in the construction of fundamental data structures such as the suffix array . In this pap er, w e in trod uce the Inverse Lyndon A rr ay , the analogous structure for inv erse Lyndon words, namely words that are lexicographi- cally greater than all their prop er suffixes. Unlik e standard Lyndon w ords, in verse Lyndon words ma y ha v e non-trivial b orders, which in tro duces a genuine theoretical difficulty . W e show that the inv erse Lyndon array can b e characterized in terms of the next greater suffix array together with a b order-correction term, and prov e that this correction coincides with a longest common extension (LCE) v alue. Building on this char- acterization, w e adapt the nearest-suffix framework underlying Ellert’s linear-time construction of the Lyndon array to the i n verse setting, ob- taining an O ( n ) -time algorithm for general ordered alphab ets. Finally , we discuss implications for suffix comparison and rep ort exp erimen ts on ran- dom, structured, and real datasets showing that the inv erse construction exhibits the same practical linear-time b eha vior as the standard one. Keyw ords: In verse Lyndon w ords · Lyndon array · Com binatorics on w ords · w ord algorithms · Suffix array 1 In tro duction Lyndon words, in tro duced b y Lyndon [ 1 ] and extensively studied in combinatorics on words, are primitiv e words that are lexicographically minimal among their rotations. Equiv alently , a non-empty word w is a Lyndon wor d if w ≺ s for ev ery non-empt y prop er suffix s of w , where ≺ denotes the lexicographic order induced b y a total order on the alphab et Σ . The Lyndon arr ay λ [1 ..n ] of a word x [1 ..n ] gives, at each p osition i , the length of the longest Lyndon factor of x starting at i . Since Bannai et al. [ 8 ] show ed that λ is sufficient to compute all maximal rep etitions in linear time, efficien t construction algorithms for λ ha ve received considerable atten tion. F ranek et al. [ 6 , 7 ] related λ to a next-smal ler-value computation on the in v erse suffix array , while Ellert [ 13 ] later ga ve the first direct linear-time algorithm on general ordered alphab ets, 2 P . Negri et al. based on nearest smaller suffix edges and an LCE acceleration mechanism inspired b y Manacher’s algorithm [ 4 ]. Subsequent work considered space-efficient v arian ts, sub-linear algorithms, and links with the Lyndon forest [14,16,17]. A closely related notion is that of inverse Lyndon wor ds , i.e., a non-empty word w is inv erse Lyndon if s ≺ w for every non-empty prop er suffix s [ 11 ], introduced in connection with the Inverse Canonic al Lyndon F actorization (ICFL), a v arian t of CFL that ma y yield more balanced factors and therefore w ell suited to bioinformatics applications [ 18 ]. A b order property for the canonical in verse Lyndon factorization was introduced in [ 26 ] and later clarified in [ 12 ]. F rom a com binatorial viewp oin t, the in verse setting is not a merely formal dualization of the standard one. The crucial asymmetry is that standard Lyndon words are un b ordered, whereas inv erse Lyndon words may ha ve non-trivial b orders. As a consequence, the classical identit y λ [ i ] = next [ i ] − i no longer survives unc hanged. The main difficulty is therefore not merely to restate Ellert’s framework with the inequalities reversed, but rather to determine precisely ho w b orders in teract with nearest greater suffixes and to show that this interaction preserv es the amortized linearit y of the underlying LCE machinery . The “b order correction” is structural: without iden tifying it as an LCE v alue on a nearest-greater-suffix edge, one would need separate border computation, breaking the direct linear-time reduction. Our result also connects tw o nearb y lines of work, namely non-crossing LCE queries on general ordered alphabets [ 24 , 25 ] and the border-based view of inv erse Lyndon w ords and ICFL [ 12 ]. In the in verse Lyndon array setting, we sho w that the needed b order correction is reco vered directly from the LCE on the corresponding NGS edge. Contributions. W e in tro duce the Inverse Lyndon Arr ay λ − 1 [1 ..n ] , which records at eac h position i the length of the longest maximal in verse Lyndon subw ord starting at i . The main con tributions of this work are as follows: – an exact characterization of λ − 1 through the next gr e ater suffix arra y , with a b order correction term (see Lemma 7); – the identification of the b order correction with an LCE v alue on a nearest- greater suffix edge, which remov es the need for any explicit b order computa- tion and distinguishes our approach from b order-explicit treatments in the in verse Lyndon literature (see Lemma 8); – the adaptation of the non-crossing prop erty , chain iteration, and LCE accel- eration from the classical NSS (Next Smaller Suffix)/PSS (Previous Smaller Suffix) setting to the no vel NGS (Next Greater Suffix)/PGS (Previous Greater Suffix) setting, yielding a linear-time algorithm for constructing the inv erse Lyndon arra y ov er general ordered alphab ets; – a study of suffix-sorting implications, showing that the compatibilit y prop ert y underlying the standard Lyndon array has no direct inv erse analogue, while the joint use of λ and λ − 1 still yields constant-time rules for several suffix comparisons. The implemen tation and b enc hmark scripts are av ailable online 1 . 1 https://github.com/FLaTNNBio/inverselyndonarray The Inv erse Lyndon Array 3 2 Preliminaries W or ds and lexic o gr aphic or ders. A wor d x = x [1] x [2] · · · x [ n ] is a sequence of sym b ols from a totally ordered alphab et ( Σ , < ) . W e write | x | = n for its length, ε for the empt y w ord, and Σ ∗ and Σ + for the sets of all words and all non- empt y words, resp ectiv ely . F or 1 ≤ i ≤ j ≤ n , the subw ord x [ i..j ] is the sequence x [ i ] x [ i + 1] · · · x [ j ] ; if i > j , it equals ε . A sub word x [1 ..j ] is a pr efix , and x i = x [ i..n ] denotes the suffix starting at p osition i . The lexic o gr aphic or der ≺ on Σ ∗ is defined as follo ws: x ≺ y if and only if either y = xw for some non-empt y w , or x = ur v and y = usw with r < s . The L ongest Common Extension of tw o suffixes is: lce( i, j ) = max {| u | | x i = uv , x j = uw } . W e frame every w ord with sentinels, i.e., x = # x (1 ..n ) $ , where # and $ satisfy # < $ < a for all a ∈ Σ in the standard setting, and # > $ > a for all a ∈ Σ in the in verse setting. In the standard case, placing # strictly b elo w every alphab et sym b ol ensures that x 1 is the lexicographic minimum among all suffixes of the framed word, so all suffix ranks are distinct and every NSS/PSS edge is unam biguous. In the in v erse case, the reversed order mak es x 1 the lexicographic maxim um, and guarantees a strict total order on all suffixes with no ties. A non-empty word w is a Lyndon wor d if w ≺ s for every non-empty prop er suffix s of w [ 2 ]. A subw ord x [ i..j ] is a maximal Lyndon subwor d of x starting at i if it is a Lyndon w ord and either j = n or x [ i..j +1] is not Lyndon. Definition 1 (Lyndon Arra y). The Lyndon array λ [1 ..n ] of a wor d x [1 ..n ] is define d by λ [ i ] = max { m ∈ [1 , n − i + 1] | x [ i..i + m − 1] is a Lyndon wor d } . The suffix arra y SA [1 ..n ] is the p ermutation such that x SA [1] ≺ x SA [2] ≺ · · · ≺ x SA [ n ] . The inv erse suffix array ISA satisfies ISA [ SA [ i ]] = i , so ISA [ j ] is the lexicographic rank of x j . F ranek et al. [6] pro ved the following c haracterization. Theorem 1 ([6]). x [ i..j ] is maximal Lyndon subwor d if and only if: 1. ISA [ i ] < ISA [ h ] for al l h ∈ [ i +1 , j ] , and 2. either j = n or ISA [ j +1] < ISA [ i ] . Consequen tly , λ can be obtained from a NSV computation on the inv erse suffix arra y , as follows. Given an array A [1 ..n ] , the NSV arra y is defined by: NSV[ i ] = min { j > i | A [ j ] < A [ i ] } − i, with NSV[ i ] = n − i + 1 if no such j exists. 4 P . Negri et al. # 1 b 2 a 3 n 4 a 5 n 6 a 7 $ 8 NSS PSS Fig. 1. NSS and PSS edges for x = # banana $ . Dashed arcs denote next smaller suffix edges, while solid arcs denote previous smaller suffix edges. 2.1 The NSS/PSS F ramew ork W e briefly recall the framework of Ellert [ 13 ] for computing λ in O ( n ) time, since it is the template for our inv erse construction. Throughout this section we assume x = # x (1 ..n ) $ with # < $ < a for all a ∈ Σ . Definition 2 (NSS and PSS Arra ys). F or a wor d x [1 ..n ] , define next [ i ] = min { j ∈ ( i, n ] | x j ≺ x i } , pr ev [ i ] = max { j ∈ [1 , i ) | x j ≺ x i } , with next [1] = next [ n ] = n +1 , pr ev [1] = 0 , and pr ev [ n ] = 1 . Example 1. F or x = # banana $ , with # < $ < a < b < n , the arrays are: i 1 2 3 4 5 6 7 8 x # b a n a n a $ next [ i ] 9 3 5 5 7 7 8 9 pr ev [ i ] 0 1 1 3 1 5 1 1 λ [ i ] 8 1 2 1 2 1 1 1 Figure 1 sho ws the corresp onding NSS and PSS edges. Lemma 1 ([13]). F or al l i ∈ [1 , n ] , next [ i ] = i + λ [ i ] . Lemma 2 ([13]). F or al l i ∈ [2 , n ] , the wor d x [ pr ev [ i ] ..i − 1] is a Lyndon wor d. The k ey prop erties enabling linear-time construction are the following. Lemma 3 (Non-Crossing [ 13 ]). L et l 1 < r 1 and l 2 < r 2 b e index p airs, e ach c onne cte d by an NSS or PSS e dge. Then it is imp ossible to have l 1 < l 2 < r 1 < r 2 . Lemma 4 (Chain Iteration [13]). F or any r ∈ [2 , n ] : 1. pr ev [ r ] = pr ev ∗ [ r − 1] , 2. next [ l ] = r if and only if l = pr ev ∗ [ r − 1] and l > pr ev [ r ] , wher e pr ev ∗ [ r − 1] denotes an element on the PSS chain fr om r − 1 . The Inv erse Lyndon Array 5 Lemma 5 (LCE A cceleration [13]). L et l = pr ev [ k ] and r = next [ k ] . Then: 1. if lce( l, k ) = lce( k, r ) , then lce( l, r ) ≥ lce( k , r ) ; 2. if lce( l, k ) < lce( k, r ) , then lce( l, r ) = lce( l , k ) and pr ev [ r ] = l ; 3. if lce( l, k ) > lce( k, r ) , then lce( l, r ) = lce( k , r ) and next [ l ] = r . T ogether with the SMAR T-LCE pro cedure of [ 13 ], whic h computes all required LCE v alues in amortized O ( n ) time via a global righ tmost-insp ected-p osition v ariable, these lemmas yield an algorithm that computes λ in O ( n ) time and O ( n ) space on general ordered alphab ets. 3 The In verse Lyndon Arra y W e now develop the theory of the inv erse Lyndon arra y . Throughout this section, x = # x (1 ..n ) $ , with sentinels satisfying # > $ > a for all a ∈ Σ . 3.1 In verse Lyndon W ords Definition 3 ([ 11 ]). A wor d w ∈ Σ + is an in verse Lyndon word if s ≺ w for every non-empty pr op er suffix s of w . Unlik e standard Lyndon words, inv erse Lyndon w ords may hav e non-trivial b orders. F or example, dab da is an inv erse Lyndon word with maximal border da . The follo wing prefix-closure prop ert y is fundamental [11]. Lemma 6 ([ 11 ]). Every non-empty pr efix of an inverse Lyndon wor d is an inverse Lyndon wor d. Definition 4 (Maximal In verse Lyndon Sub w ord). A subwor d x [ i..j ] is a maximal inv erse Lyndon sub word starting at i if it is an inverse Lyndon wor d and either j = n or x [ i..j +1] is not an inverse Lyndon wor d. Definition 5 (In verse Lyndon Arra y). The Inverse Lyndon A rr ay is define d by: λ − 1 [ i ] = max { m ∈ [1 , n − i + 1] | x [ i..i + m − 1] is an inverse Lyndon wor d } . Example 2. Let x = # aab abb aa $ , with # > $ > b > a . The inv erse Lyndon arra y and the asso ciated NGS/PGS arrays are: i 1 2 3 4 5 6 7 8 9 10 x # a a b a b b a a $ λ − 1 10 2 1 3 1 4 3 2 1 1 next − 1 11 3 4 6 6 10 10 9 10 11 pr ev − 1 0 1 1 1 4 1 6 7 7 1 nlc e 0 1 0 1 0 0 0 1 0 0 A t i = 4 , the maximal inv erse Lyndon sub word is x [4 .. 6] = b ab , so λ − 1 [4] = 3 . Its longest b order has length 1 , hence next − 1 [4] = 6 and nlc e [4] = 1 , in agreemen t with Corollary 1. At i = 6 , the maximal in verse Lyndon sub word is x [6 .. 9] = bb aa , so λ − 1 [6] = 4 . Since this factor is unbordered, next − 1 [6] = 10 and nlc e [6] = 0 . 6 P . Negri et al. 3.2 NGS/PGS Arra ys and their Relationship to λ − 1 Definition 6 (NGS and PGS Arra ys). F or a wor d x [1 ..n ] with inverte d sentinels, define: next − 1 [ i ] = min { j ∈ ( i, n ] | x j ≻ x i } , pr ev − 1 [ i ] = max { j ∈ [1 , i ) | x j ≻ x i } , with next − 1 [1] = next − 1 [ n ] = n +1 , pr ev − 1 [1] = 0 , and pr ev − 1 [ n ] = 1 . Th us, in the in verse setting, the relev an t edges connect nearest gr e ater suffixes rather than nearest smaller suffixes. Lemma 7 (NGS and λ − 1 equiv alence). L et z b e the maximal inverse Lyndon subwor d starting at i , and let b order ( z ) denote the length of its longest pr op er b or der. Then: next − 1 [ i ] = i + λ − 1 [ i ] − b order( z ) . Pr o of. W rite z = bhb , where b ∈ Σ ∗ is the border, p ossibly empty , and h ∈ Σ + . Then x i = bhbα for some α ∈ Σ + . Let j = next − 1 [ i ] . W e claim that x j = bα , that is, the suffix x j starts at the second o ccurrence of the b order. Supp ose first that x j starts b efore that o ccurrence, say x j = β bα with β ∈ Σ + and z = bγ β b . Then β b is a prop er suffix of z , so by Definition 3 we hav e z ≻ β b . Hence z α ≻ β bα , that is, x i ≻ x j , con tradicting the choice of j . Supp ose instead that x j starts after the border, sa y x j = β α with b = γ β and γ , β ∈ Σ + . Consider the intermediate suffix x k m = β hγ β α . Since j = next − 1 [ i ] , w e hav e x k m ≺ x i ≺ x j . Ho wev er, from x j = β α ≻ β hγ β α = x k m , w e obtain bα = γ β α ≻ γ β hγ β α = x i , whic h implies x k m ≻ x i , again a con tradiction. Therefore x j = bα , and so j = i + | z | − | b | = i + λ − 1 [ i ] − b order( z ) . The b order term is the main structural difference with the standard case. Standard Lyndon words are unbordered, so the corresp onding correction term is alw ays zero. Lemma 8 (Border via LCE). L et z b e the maximal inverse Lyndon subwor d starting at i , and let j = next − 1 [ i ] . Then lce( i, j ) = b order( z ) . Pr o of. Using the notation ab o ve, Lemma 7 yields x j = bα . Since x i = bhbα and x j = bα , the suffixes x i and x j agree on the prefix b , so lce( i, j ) ≥ | b | . F or the upp er bound, note that x j ≻ x i since j = next − 1 [ i ] . The tw o suffixes share a common prefix of length | b | (b oth b egin with b ); at the next position x i [ | b | + 1] = h [1] while x j [ | b | + 1] = α [1] . Since x j ≻ x i and they agree on the first | b | p ositions, the comparison is decided at p osition | b | + 1 , whic h forces α [1] > h [1] . In particular α [1]  = h [1] , so the common prefix ends exactly at length | b | and lce( i, j ) = | b | = border( z ) . The Inv erse Lyndon Array 7 b h b α maximal inv erse Lyndon word z = bhb suffix x i suffix x j i j = next − 1 [ i ] lce( i, j ) = | b | λ − 1 [ i ] = ( j − i ) + | b | start of the next greater suffix Fig. 2. Border correction for the inv erse Lyndon array . If the maximal inv erse Lyndon w ord starting at p osition i has the form z = bhb , then the next greater suffix starts at the second o ccurrence of b . Consequently , lce( i, j ) = | b | and λ − 1 [ i ] = ( j − i ) + | b | . Corollary 1. F or every i , λ − 1 [ i ] = next − 1 [ i ] − i + lce( i, next − 1 [ i ]) . R emark 1 (Why the b or der term is structur al). The correction term in Corollary 1 is not a p ost-pro cessing detail. Without Lemma 8, the in verse arra y w ould require explicit b order information for each maximal inv erse Lyndon factor. That w ould in tro duce an additional computational lay er not presen t in the standard case. The point of Lemma 8 is precisely that the border is already encoded b y the same nearest-suffix LCE information computed b y the algorithm. Th us, computing next − 1 together with the asso ciated nlc e arra y suffices to reco ver λ − 1 . Lemma 9. F or al l i ∈ [2 , n ] , x [ pr ev − 1 [ i ] ..i − 1] is an inverse Lyndon wor d. Pr o of. Let p = pr ev − 1 [ i ] and w = x [ p..i − 1] . By definition of PGS, x p ≻ x i and x k ≺ x i for ev ery k ∈ ( p, i ) . By transitivity , x p ≻ x k for ev ery k ∈ ( p, i ) . Let s = x [ k ..i − 1] b e any prop er non-empty suffix of w , with k ∈ ( p, i ) . Since x p ≻ x k , let m ≥ 1 b e the first index where x p and x k differ; then x [ p + m − 1] > x [ k + m − 1] . If m ≤ i − k = | s | , the mismatc h falls within both w and s , so w ≻ s . If m > i − k , then s = x [ k ..i − 1] is a proper prefix of w (they agree on all | s | p ositions of s ), so w ≻ s b y the prefix rule of lexicographic order. In b oth cases w ≻ s , so w is an inv erse Lyndon word. 3.3 Com binatorial Prop erties of NGS/PGS Edges W e now show that the main structural lemmas from the NSS/PSS setting remain v alid in the inv erse case. These lemmas also fit into the broader context of non- crossing LCE-based techniques on gene ral ordered alphab ets. Here, how ever, the non-crossing structure is induced sp ecifically by NGS/PGS edges in the inv erse Lyndon setting. 8 P . Negri et al. Lemma 10 (Non-Crossing of NGS/PGS Edges). L et l 1 < r 1 and l 2 < r 2 b e index p airs, e ach c onne cte d by an NGS or PGS e dge. Then it is imp ossible to have l 1 < l 2 < r 1 < r 2 . Pr o of. Assume for con tradiction that l 1 < l 2 < r 1 < r 2 . Since l 2 ∈ ( l 1 , r 1 ) and l 1 , r 1 are connected b y an NGS or PGS edge, ev ery suffix starting in ( l 1 , r 1 ) is lexicographically smaller than b oth endp oin ts, hence x l 2 ≺ x r 1 . On the other hand, since r 1 ∈ ( l 2 , r 2 ) and l 2 , r 2 are connected, w e get x l 2 ≻ x r 1 . This con tradiction pro ves the claim. Lemma 11 (Chain Iteration for NGS/PGS). F or any r ∈ [2 , n ] : 1. pr ev − 1 [ r ] = pr ev ∗ − 1 [ r − 1] ; 2. next − 1 [ l ] = r if and only if l = pr ev ∗ − 1 [ r − 1] and l > pr ev − 1 [ r ] . Pr o of. F or the first statemen t, suppose pr ev − 1 [ r ]  = pr ev ∗ − 1 [ r − 1] . Then there exists r ′ = pr ev ∗ − 1 [ r − 1] such that pr ev − 1 [ r ] ∈ ( pr ev − 1 [ r ′ ] , r ′ ) , giving pr ev − 1 [ r ′ ] < pr ev − 1 [ r ] < r ′ < r, whic h contradicts Lemma 10. F or the second statement, assume first that next − 1 [ l ] = r . Then all suffixes starting in ( l, r ) are smaller than x r , so pr ev − 1 [ r ] / ∈ [ l, r ) and therefore pr ev − 1 [ r ] < l . If l  = pr ev ∗ − 1 [ r − 1] , then there exists r ′ = pr ev ∗ − 1 [ r − 1] such that pr ev − 1 [ r ′ ] < l < r ′ < r, again con tradicting Lemma 10. Con versely , assume that pr ev − 1 [ r ] < l and l = pr ev ∗ − 1 [ r − 1] . Since l ∈ ( pr ev − 1 [ r ] , r ) , we hav e x l ≺ x r . Moreov er, the chain prop erty implies x k ≺ x l for ev ery k ∈ ( l, r ) . Therefore r is the first p osition to the right of l with a greater suffix, namely next − 1 [ l ] = r . Lemma 12 (LCE Acceleration for NGS/PGS). L et l = pr ev − 1 [ k ] and r = next − 1 [ k ] . Then: 1. if lce ( l, k ) = lce ( k , r ) , then lce ( l, r ) ≥ lce ( k , r ) , and either pr ev − 1 [ r ] = l or next − 1 [ l ] = r ; 2. if lce( l, k ) < lce( k, r ) , then lce( l, r ) = lce( l , k ) and pr ev − 1 [ r ] = l ; 3. if lce( l, k ) > lce( k, r ) , then lce( l, r ) = lce( k , r ) and next − 1 [ l ] = r . Pr o of. F or the first case, since l = pr ev − 1 [ k ] and r = next − 1 [ k ] , we hav e x l ≻ x k ≻ x m and x r ≻ x k ≻ x m for all m ∈ ( l, r ) \ { k } . Since lce ( l, k ) = lce ( k , r ) , it follo ws that lce ( l, r ) ≥ lce ( k , r ) . If x l ≻ x r , then x l ≻ x r ≻ x m for all such m , so pr ev − 1 [ r ] = l . By symmetry , if x r ≻ x l , then next − 1 [ l ] = r . F or the second case, let c = lce ( l, k ) . Since x l ≻ x k , w e hav e x [ l + c ] > x [ k + c ] . Because lce ( l, k ) < lce ( k , r ) , the suffix x r shares a prefix of length at least c + 1 with x k , hence x [ r + c ] = x [ k + c ] . Thus x l and x r agree for c p ositions and differ at the next one with x [ l + c ] > x [ r + c ] , so x l ≻ x r and lce ( l, r ) = c . Since x r ≻ x m for all m ∈ ( l , r ) , it follows that pr ev − 1 [ r ] = l . The third case is symmetric, yielding x r ≻ x l , lce ( l, r ) = lce ( k , r ) , and next − 1 [ l ] = r . The Inv erse Lyndon Array 9 Input: word x [1 ..n ] in inv erse sentinel mo de Output: next − 1 , pr ev − 1 , nlc e , plc e , λ − 1 pr ev − 1 [1] ← 0 ; next − 1 [ n ] ← n + 1 ; plc e [1] ← 0 for r ← 2 to n do ℓ ← r − 1 ; m ← Smar t-Lce ( ℓ, r, 0) while x [ ℓ + m ] < x [ r + m ] do next − 1 [ ℓ ] ← r ; nlc e [ ℓ ] ← m if m = plc e [ ℓ ] then m ← Smar t-Lce ( pr ev − 1 [ ℓ ] , r , m ) end else if m > plc e [ ℓ ] then m ← plc e [ ℓ ] end ℓ ← pr ev − 1 [ ℓ ] end pr ev − 1 [ r ] ← ℓ ; plc e [ r ] ← m end for i ← 1 to n do λ − 1 [ i ] ← next − 1 [ i ] − i + nlc e [ i ] end Algorithm 1: LCE-NGS for the in verse Lyndon array 3.4 The LCE-NGS Algorithm Lemmas 11 and 12 yield Algorithm 1, which mirrors the structure of the standard LCE-NSS algorithm [ 13 ]. The key difference is the direction of the comparison on line 8: in the inv erse setting we test whether x [ ℓ + m ] < x [ r + m ] , b ecause w e are lo oking for the first suffix to the right that is lexicographically greater. The pro cedure Smar t-Lce is the same amortized primitive used in the standard setting [ 13 ], but here it is queried only on NGS/PGS candidates. F or self-con tainment, w e recall the t w o inv ariants that are used in the pro of of Theorem 2. First, after eac h explicit scan, rhs is the righ tmost text p osition insp ected so far. Second, whenever an edge v alue has already been fixed, the corresp onding stored quantit y in nlc e or plc e is exactly the required LCE for that edge. Hence, if a query ( ℓ, r, m ) satisfies r + m < rhs , then the answer lies en tirely inside a previously certified window and can b e returned in O (1) time from the stored edge information selected b y Lemma 12. Otherwise, Smar t-Lce extends explicitly , increases rhs , and stores the resulting LCE v alue. This is the only place where new c haracter comparisons are p erformed. Theorem 2. A lgorithm 1 c omputes the inverse Lyndon arr ay λ − 1 [1 ..n ] in O ( n ) time and O ( n ) sp ac e on gener al or der e d alphab ets. Pr o of. Correctness follo ws from Lemmas 7, 8, 10, 11, and 12. The additional p oin t to justify is that the b order correction do es not create extra rescanning. By Lemma 8, whenever next − 1 [ ℓ ] = r is fixed, the same v alue nlc e [ ℓ ] = lce ( ℓ, r ) is b oth the edge LCE needed b y the algorithm and the border correction needed later for λ − 1 [ ℓ ] . Therefore the algorithm never computes b orders explicitly and nev er p erforms a separate pass for b order information. 10 P . Negri et al. Moreo ver, if Smar t-Lce is called with r + m < rhs , then the query lies en tirely inside a region already certified b y a previous explicit scan. In that case, Lemma 12 and the stored nlc e or plc e v alues suffice to answ er the query without rescanning c haracters inside that windo w. Thus b orders cannot trigger hidden rescanning there. Every explicit c haracter comparison strictly adv ances the global fron tier rhs , which is monotone and never exceeds n . Hence the total num b er of explicit comparisons is O ( n ) . The outer lo op p erforms n − 1 iterations. Each index receives its final next − 1 v alue once, eac h stored LCE v alue is written once, and the final recov ery formula λ − 1 [ i ] = next − 1 [ i ] − i + nlc e [ i ] is applied once p er p osition. Therefore the total running time is O ( n ) and the space usage is O ( n ) . 4 Exp erimen tal Ev aluation W e implemen ted LCE-NSS for λ and LCE-NGS for λ − 1 in C++17. The tim- ing runs rep orted here w ere produced b y the b enc hmark pipeline with g++ -std=c++17 -O2 . W e measured wall-clock construction time and, when relev ant, rep ort the recov ery pass separately . All exp erimen ts w ere run on a Dell Precision 7960 T o wer under Ubun tu Lin ux, with an Intel Xeon w5-3425 CPU and 128 GB of ECC RAM. Correctness was c heck ed against brute force on small and medium inputs and on crafted edge cases. W e used random w ords ov er alphab ets of size σ ∈ { 2 , 4 , 26 } , structured syn thetic families, and real texts from Pizza&Chili and the Large Canterbury Corpus [ 21 , 22 , 23 ]. The profiling-enabled implemen tation also counted explicit c haracter comparisons, LCE reuse hits inside the certified Smar t-Lce window, and explicit extension calls. T able 1 reports the random-input timings. On instances with n ≥ 5 × 10 4 , the mean ratio LCE-NGS / LCE-NSS is 1 . 0060 , the median ratio is 0 . 9990 , and the observed range is [0 . 9882 , 1 . 0939] , which shows that the inv erse construction matc hes the standard one closely across all alphab et sizes. T able 2 reports the results obtained on real files. A cross the fiv e corp ora, the mean ratio is 1 . 0019 , the median ratio is 1 . 0024 , and the observed range is [0 . 9665 , 1 . 0325] . The structured synthetic families show the same practical b eha vior. Over all structured instances with n ≥ 5 × 10 4 , the mean ratio is 0 . 9852 , the median ratio is 0 . 9981 , and the observed range is [0 . 8848 , 1 . 0874] . Th us the in verse kernel remains close to the standard one across random, structured, and real inputs. T o stress the b order-correction path, w e additionally profiled long-b order families while recording explicit c haracter comparisons, LCE reuse hits, and explicit extension calls inside Smar t-Lce . Eac h instance contains a repeated prefix-suffix b order o ccup ying either 25% or 40% of the full length. The timing ratios on these b order-hea vy inputs remain close to one, mean 1 . 0013 , median 1 . 0031 , and range [0 . 9680 , 1 . 0257] . T able 3 shows that all counters remain linear The Inv erse Lyndon Array 11 T able 1. Core construction time ( µ s) on random words. σ = 2 σ = 4 σ = 26 n LCE-NSS LCE-NGS LCE-NSS LCE-NGS LCE-NSS LCE-NGS 10 3 15.4 15.0 15.8 15.4 15.6 16.0 5 × 10 3 79.6 84.2 95.6 101.2 94.0 92.6 10 4 171.4 179.0 205.4 207.0 194.0 190.6 5 × 10 4 948.2 1 037.2 1 062.0 1 059.2 1 006.6 1 000.6 10 5 1 974.2 1 973.4 2 187.0 2 162.4 2 060.6 2 053.8 5 × 10 5 9 197.8 9 416.2 10 503.2 10 543.2 10 087.8 9 981.4 10 6 18 060.0 18 320.2 20 972.4 20 955.0 20 121.8 19 919.2 2 × 10 6 35 778.4 36 154.0 41 893.6 41 840.6 40 219.0 39 744.4 5 × 10 6 89 178.6 90 033.6 102 314.0 104 479.2 100 569.0 99 436.6 T able 2. Core construction time ( µ s) on real corp ora. File n LCE-NSS LCE-NGS Ratio english.txt 5 000 000 98 850.6 98 846.6 1.0000 dna.txt 5 000 000 104 074.2 104 328.6 1.0024 bible.txt 4 047 392 75 372.2 77 819.2 1.0325 e.coli 4 638 690 96 704.0 93 460.0 0.9665 world192.txt 2 473 400 46 763.6 47 144.6 1.0081 in n . In particular, from 10 5 to 10 6 , b oth explicit comparison counts and explicit extension counts grow b y ab out 10 × in b oth families, which matches Theorem 2. Deep b orders do affect constants, but they do not trigger superlinear rescanning. 5 Conclusions and Suffix-Sorting P ersp ectiv es W e in tro duced the in verse Lyndon arra y and c haracterized it in terms of the next greater suffix array and a b order correction term. W e also sho wed that the combinatorial ingredients b ehind the NSS/PSS framew ork transfer to the NGS/PGS setting, yielding a linear-time construction on general ordered alpha- b ets. The cen tral tec hnical p oint is that borders do not create hidden rescanning costs: the required correction term is already enco ded by LCE v alues on nearest- greater-suffix edges and is absorbed by the same amortized machinery as in the standard case. Experimentally , the inv erse construction preserves the same practical linear-time profile as the standard one. A natural p ersp ectiv e is suffix sorting with the aid of the standard and in verse Lyndon arra ys. It is w ell kno wn that the standard Lyndon arra y supports suffix sorting through the compatibilit y prop ert y w λ ( i ) ≺ w λ ( j ) = ⇒ x i ≺ x j , where w λ ( i ) denotes the maximal Lyndon sub word starting at i [9,15]. By con trast, the inv erse case do es not admit a direct analogue. 12 P . Negri et al. T able 3. Border-heavy profiling counters. V alues are av erages ov er three runs. F amily n NSS cmp. NGS cmp. NSS reuse NGS reuse NSS ext. NGS ext. 25% border 100 000 213 984 213 427 29 347 29 258 148 890 148 233 40% border 100 000 200 510 208 599 44 719 35 317 131 483 141 882 25% border 500 000 945 829 1 039 872 294 036 179 692 582 948 705 698 40% border 500 000 947 176 863 646 292 529 393 730 584 150 475 202 25% border 1 000 000 2 134 164 2 134 660 293 205 293 234 1 481 797 1 483 100 40% border 1 000 000 2 135 385 2 123 841 292 240 306 666 1 483 741 1 468 012 Prop osition 1 (No direct inv erse compatibility). Ther e is no dir e ct inverse analo gue of the standar d c omp atibility pr op erty for maximal inverse Lyndon factors. In p articular, it is not true in gener al that w λ − 1 ( i ) ≺ w λ − 1 ( j ) = ⇒ x i ≺ x j . Pr o of. T ak e x = b ab acb ab aa with i = 1 and j = 6 . Then w λ − 1 (1) = b ab a and w λ − 1 (6) = b ab aa , so w λ − 1 (1) ≺ w λ − 1 (6) , but x 1 = b ab acb ab aa ≻ x 6 = b ab aa . The obstruction is the prefix case: if an in verse Lyndon word has the form z = z ′ α , one may hav e z ≻ α while only z ′ ⪰ α , so the contin uation of the suffix is not determined b y the maximal inv erse factors alone. Nev ertheless, the in verse Lyndon array still provides constant-time comparison rules for relev an t suffix pairs. Prop osition 2. F or p ositions i < j , e ach of the fol lowing c onditions determines the or der of x i and x j in c onstant time: 1. if j < i + λ [ i ] , then x i ≺ x j ; 2. if j = next − 1 [ i ] , e quivalently j = i + λ − 1 [ i ] − nlc e [ i ] , then x i ≺ x j ; 3. if j < next − 1 [ i ] , e quivalently j < i + λ − 1 [ i ] − nlc e [ i ] , then x i ≻ x j ; 4. if i = pr ev [ j ] or i = pr ev − 1 [ j ] , the c omp arison fol lows imme diately fr om the defining e dge. Pr o of. Eac h item follows directly from the definitions. Com bining the present results with other prop erties of ICFL may support efficien t approac hes to suffix sorting, a direction that also aligns with the broader formal-language p erspective discussed in [27]. Disclosure of In terests. The authors hav e no comp eting interests to declare that are relev an t to the con tent of this article. The Inv erse Lyndon Array 13 References 1. Lyndon, R.C.: On Burnside’s problem. II. T rans. Amer. Math. So c. 78 (2), 329–332 (1955). https://doi.org/10.1090/S0002- 9947- 1955- 0067884- 8 2. Duv al, J.P .: F actorizing w ords o ver an ordered alphabet. J. Algorithms 4 (4), 363–381 (1983). https://doi.org/10.1016/0196- 6774(83)90017- 2 3. Man b er, U., Myers, G.: Suffix arrays: A new method for on-line w ord searc hes. SIAM J. Comput. 22 (5), 935–948 (1993). https://doi.org/10.1137/0222058 4. Manac her, G.K.: A new linear-time “on-line” algorithm for finding the smallest initial palindrome of a word. J. ACM 22 (3), 346–351 (1975) 5. Alstrup, S., Gav oille, C., Kaplan, H., Rauhe, T.: Nearest common ancestors: a surv ey and a new distributed algorithm. In: Pro c. SP AA 2002, pp. 258–264. ACM (2002). https://doi.org/10.1145/564870.564914 6. F ranek, F., Islam, A.S.M.S., Rahman, M.S., Smyth, W.F.: Algorithms to compute the Lyndon arra y . In: Proc. Prague Stringology Conference 2016, pp. 172–184 (2016) 7. F ranek, F., Liut, M.: Algorithms to compute the Lyndon array revisited. In: Pro c. Prague Stringology Conference 2019, pp. 16–28 (2019) 8. Bannai, H., I, T., Inenaga, S., Nak ashima, Y., T ak eda, M., T suruta, K.: The “runs” theorem. SIAM J. Comput. 46 (5), 1501–1514 (2017). https://doi.org/10.1137/ 15M1011032 9. Baier, U.: Linear-time suffix sorting, a new approach for suffix array construction. In: CPM 2016, LIPIcs, vol. 54, pp. 23:1–23:12. Schloss Dagstuhl (2016). ht tps : //doi.org/10.4230/LIPIcs.CPM.2016.23 10. F ranek, F., Parac ha, A., Smyth, W.F.: The linear equiv alence of the suffix arra y and the partially sorted Lyndon array . In: Pro c. Prague Stringology Conference 2017, pp. 77–84 (2017) 11. Bonizzoni, P ., De F elice, C., Zaccagnino, R., Zizza, R.: In verse Lyndon words and in verse Lyndon factorizations of words. Adv. Appl. Math. 101 , 281–319 (2018). https://doi.org/10.1016/j.aam.2018.08.005 12. Bonizzoni, P ., De F elice, C., Riccardi, B., Zaccagnino, R., Zizza, R.: Unv eiling the connection b et ween the Lyndon factorization and the canonical inv erse Lyndon factorization via a b order prop ert y . In: MF CS 2024, LIPIcs, v ol. 306, pp. 31:1–31:14. Sc hloss Dagstuhl (2024). https://doi.org/10.4230/LIPIcs.MFCS.2024.31 13. Ellert, J.: Lyndon arra ys simplified. In: ESA 2022, LIPIcs, vol. 244, pp. 48:1–48:14. Sc hloss Dagstuhl (2022). https://doi.org/10.4230/LIPIcs.ESA.2022.48 14. Bille, P ., Ellert, J., Fisc her, J., Gørtz, I.L., Kurpicz, F., Munro, I., Rotenberg, E.: Space efficient construction of Lyndon arra ys in linear time. CoRR abs/1911.03542 (2019) 15. Bertram, N., Ellert, J., Fischer, J.: Lyndon words accelerate suffix sorting. In: ESA 2021, LIPIcs, vol. 204, pp. 15:1–15:13. Schloss Dagstuhl (2021). https://doi.org/ 10.4230/LIPIcs.ESA.2021.15 16. Bannai, H., Ellert, J.: Lyndon arra ys in sublinear time. In: ESA 2023, LIPIcs, vol. 274, pp. 14:1–14:16. Schloss Dagstuhl (2023). https://doi.org/10.4230/LIPIcs .ESA.2023.14 17. Badk ob eh, G., Cro c hemore, M., Ellert, J., Nicaud, C.: Bac k-to-front online Lyndon forest construction. In: CPM 2022, LIPIcs, vol. 223, pp. 13:1–13:23. Schloss Dagstuhl (2022). https://doi.org/10.4230/LIPIcs.CPM.2022.13 18. Bonizzoni, P ., et al.: Numeric Lyndon-based feature embedding of sequencing reads for mac hine learning approaches. Inform. Sci. 607 , 458–476 (2022). https: //doi.org/10.1016/j.ins.2022.06.005 14 P . Negri et al. 19. Bauer, M.J., Cox, A.J., Rosone, G.: Light weigh t algorithms for constructing and in verting the BWT of word collections. Theoret. Comput. Sci. 483 , 134–148 (2013). https://doi.org/10.1016/j.tcs.2012.02.002 20. Fisc her, J., Kurpicz, F.: Dismantling DivSufSort. CoRR abs/1710.01896 (2017) 21. F erragina, P ., Na v arro, G.: The Pizza&Chili Corpus. Dataset and b enc hmark collection for compressed indexes (2007). https://pizzachili.dcc.uchile.cl/ 22. Arnold, R., Bell, T.: A corpus for the ev aluation of lossless compression algorithms. In: Pro ceedings of the Data Compression Conference, pp. 201–210. IEEE Computer So ciet y (1997) 23. The Canterbury Corpus: Descriptions of the corpora. Univ ersity of Canterbury . https://corpus.canterbury.ac.nz/descriptions/ 24. Cro c hemore, M., Iliop oulos, C.S., K o ciumak a, T., Kundu, R., Pissis, S.P ., Ra- doszewski, J., Rytter, W., W aleń, T.: Near-Optimal Computation of Runs o ver General Alphab et via Non-Crossing LCE Queries. In: String Pro cessing and In- formation Retriev al (SPIRE 2016), LNCS, v ol. 9954, pp. 22–34. Springer, Cham (2016). https://doi.org/10.1007/978- 3- 319- 46049- 9_3 25. Ellert, J., Fisc her, J.: Linear Time Runs Over General Ordered Alphab ets. In: 48th In ternational Colloquium on Automata, Languages, and Programming (ICALP 2021), LIPIcs, vol. 198, pp. 63:1–63:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021). https://doi.org/10.4230/LIPIcs.ICALP.2021.63 26. Bonizzoni, P ., De F elice, C., Zaccagnino, R., Zizza, R.: On the longest common prefix of suffixes in an inv erse Lyndon factorization and other prop erties. Theor. Comput. Sci. 862 , 24–41 (2021). https://doi.org/10.1016/j.tcs.2020.10.034 27. Bonizzoni, P ., De F elice, C., Pirola, Y., Rizzi, R., Zaccagnino, R., Zizza, R.: Can F ormal Languages Help Pangenomics to Represent and Analyze Multiple Genomes? In: Diekert, V., V olko v, M.V. (eds.) Dev elopments in Language Theory , DL T 2022. LNCS, vol. 13257, pp. 3–12. Springer, Cham (2022). https://doi.org/10.1007/ 978- 3- 031- 05578- 2_1

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment