The Inverse Lyndon Array: Definition, Properties, and Linear-Time Construction

The In v erse Lyndon Arra y: Deﬁnition, Prop erties, and Linear-Time Construction Pietro Negri 1 , Man uel Sica 1 , Ro cco Zaccagnino 1 , and Rosalba Zizza 1 Departmen t of Computer Science, Universit y of Salerno, Fisciano (SA), Italy {masica, rzaccagnino, rzizza}@unisa.it, p.negri137@gmail.com Abstract. The Lyndon arra y stores, at each position of a word, the length of the longest maximal Lyndon subw ord starting at that p osition, and plays an important role in com binatorics on words, for example in the construction of fundamental data structures such as the suﬃx array . In this pap er, w e in trod uce the Inverse Lyndon A rr ay , the analogous structure for inv erse Lyndon words, namely words that are lexicographi- cally greater than all their prop er suﬃxes. Unlik e standard Lyndon w ords, in verse Lyndon words ma y ha v e non-trivial b orders, which in tro duces a genuine theoretical diﬃculty . W e show that the inv erse Lyndon array can b e characterized in terms of the next greater suﬃx array together with a b order-correction term, and prov e that this correction coincides with a longest common extension (LCE) v alue. Building on this char- acterization, w e adapt the nearest-suﬃx framework underlying Ellert’s linear-time construction of the Lyndon array to the i n verse setting, ob- taining an O ( n ) -time algorithm for general ordered alphab ets. Finally , we discuss implications for suﬃx comparison and rep ort exp erimen ts on ran- dom, structured, and real datasets showing that the inv erse construction exhibits the same practical linear-time b eha vior as the standard one. Keyw ords: In verse Lyndon w ords · Lyndon array · Com binatorics on w ords · w ord algorithms · Suﬃx array 1 In tro duction Lyndon words, in tro duced b y Lyndon [ 1 ] and extensively studied in combinatorics on words, are primitiv e words that are lexicographically minimal among their rotations. Equiv alently , a non-empty word w is a Lyndon wor d if w ≺ s for ev ery non-empt y prop er suﬃx s of w , where ≺ denotes the lexicographic order induced b y a total order on the alphab et Σ . The Lyndon arr ay λ [1 ..n ] of a word x [1 ..n ] gives, at each p osition i , the length of the longest Lyndon factor of x starting at i . Since Bannai et al. [ 8 ] show ed that λ is suﬃcient to compute all maximal rep etitions in linear time, eﬃcien t construction algorithms for λ ha ve received considerable atten tion. F ranek et al. [ 6 , 7 ] related λ to a next-smal ler-value computation on the in v erse suﬃx array , while Ellert [ 13 ] later ga ve the ﬁrst direct linear-time algorithm on general ordered alphab ets, 2 P . Negri et al. based on nearest smaller suﬃx edges and an LCE acceleration mechanism inspired b y Manacher’s algorithm [ 4 ]. Subsequent work considered space-eﬃcient v arian ts, sub-linear algorithms, and links with the Lyndon forest [14,16,17]. A closely related notion is that of inverse Lyndon wor ds , i.e., a non-empty word w is inv erse Lyndon if s ≺ w for every non-empty prop er suﬃx s [ 11 ], introduced in connection with the Inverse Canonic al Lyndon F actorization (ICFL), a v arian t of CFL that ma y yield more balanced factors and therefore w ell suited to bioinformatics applications [ 18 ]. A b order property for the canonical in verse Lyndon factorization was introduced in [ 26 ] and later clariﬁed in [ 12 ]. F rom a com binatorial viewp oin t, the in verse setting is not a merely formal dualization of the standard one. The crucial asymmetry is that standard Lyndon words are un b ordered, whereas inv erse Lyndon words may ha ve non-trivial b orders. As a consequence, the classical identit y λ [ i ] = next [ i ] − i no longer survives unc hanged. The main diﬃculty is therefore not merely to restate Ellert’s framework with the inequalities reversed, but rather to determine precisely ho w b orders in teract with nearest greater suﬃxes and to show that this interaction preserv es the amortized linearit y of the underlying LCE machinery . The “b order correction” is structural: without iden tifying it as an LCE v alue on a nearest-greater-suﬃx edge, one would need separate border computation, breaking the direct linear-time reduction. Our result also connects tw o nearb y lines of work, namely non-crossing LCE queries on general ordered alphabets [ 24 , 25 ] and the border-based view of inv erse Lyndon w ords and ICFL [ 12 ]. In the in verse Lyndon array setting, we sho w that the needed b order correction is reco vered directly from the LCE on the corresponding NGS edge. Contributions. W e in tro duce the Inverse Lyndon Arr ay λ − 1 [1 ..n ] , which records at eac h position i the length of the longest maximal in verse Lyndon subw ord starting at i . The main con tributions of this work are as follows: – an exact characterization of λ − 1 through the next gr e ater suﬃx arra y , with a b order correction term (see Lemma 7); – the identiﬁcation of the b order correction with an LCE v alue on a nearest- greater suﬃx edge, which remov es the need for any explicit b order computa- tion and distinguishes our approach from b order-explicit treatments in the in verse Lyndon literature (see Lemma 8); – the adaptation of the non-crossing prop erty , chain iteration, and LCE accel- eration from the classical NSS (Next Smaller Suﬃx)/PSS (Previous Smaller Suﬃx) setting to the no vel NGS (Next Greater Suﬃx)/PGS (Previous Greater Suﬃx) setting, yielding a linear-time algorithm for constructing the inv erse Lyndon arra y ov er general ordered alphab ets; – a study of suﬃx-sorting implications, showing that the compatibilit y prop ert y underlying the standard Lyndon array has no direct inv erse analogue, while the joint use of λ and λ − 1 still yields constant-time rules for several suﬃx comparisons. The implemen tation and b enc hmark scripts are av ailable online 1 . 1 https://github.com/FLaTNNBio/inverselyndonarray The Inv erse Lyndon Array 3 2 Preliminaries W or ds and lexic o gr aphic or ders. A wor d x = x [1] x [2] · · · x [ n ] is a sequence of sym b ols from a totally ordered alphab et ( Σ , < ) . W e write | x | = n for its length, ε for the empt y w ord, and Σ ∗ and Σ + for the sets of all words and all non- empt y words, resp ectiv ely . F or 1 ≤ i ≤ j ≤ n , the subw ord x [ i..j ] is the sequence x [ i ] x [ i + 1] · · · x [ j ] ; if i > j , it equals ε . A sub word x [1 ..j ] is a pr eﬁx , and x i = x [ i..n ] denotes the suﬃx starting at p osition i . The lexic o gr aphic or der ≺ on Σ ∗ is deﬁned as follo ws: x ≺ y if and only if either y = xw for some non-empt y w , or x = ur v and y = usw with r < s . The L ongest Common Extension of tw o suﬃxes is: lce( i, j ) = max {| u | | x i = uv , x j = uw } . W e frame every w ord with sentinels, i.e., x = # x (1 ..n ) $ , where # and $ satisfy # < $ < a for all a ∈ Σ in the standard setting, and # > $ > a for all a ∈ Σ in the in verse setting. In the standard case, placing # strictly b elo w every alphab et sym b ol ensures that x 1 is the lexicographic minimum among all suﬃxes of the framed word, so all suﬃx ranks are distinct and every NSS/PSS edge is unam biguous. In the in v erse case, the reversed order mak es x 1 the lexicographic maxim um, and guarantees a strict total order on all suﬃxes with no ties. A non-empty word w is a Lyndon wor d if w ≺ s for every non-empty prop er suﬃx s of w [ 2 ]. A subw ord x [ i..j ] is a maximal Lyndon subwor d of x starting at i if it is a Lyndon w ord and either j = n or x [ i..j +1] is not Lyndon. Deﬁnition 1 (Lyndon Arra y). The Lyndon array λ [1 ..n ] of a wor d x [1 ..n ] is deﬁne d by λ [ i ] = max { m ∈ [1 , n − i + 1] | x [ i..i + m − 1] is a Lyndon wor d } . The suﬃx arra y SA [1 ..n ] is the p ermutation such that x SA [1] ≺ x SA [2] ≺ · · · ≺ x SA [ n ] . The inv erse suﬃx array ISA satisﬁes ISA [ SA [ i ]] = i , so ISA [ j ] is the lexicographic rank of x j . F ranek et al. [6] pro ved the following c haracterization. Theorem 1 ([6]). x [ i..j ] is maximal Lyndon subwor d if and only if: 1. ISA [ i ] < ISA [ h ] for al l h ∈ [ i +1 , j ] , and 2. either j = n or ISA [ j +1] < ISA [ i ] . Consequen tly , λ can be obtained from a NSV computation on the inv erse suﬃx arra y , as follows. Given an array A [1 ..n ] , the NSV arra y is deﬁned by: NSV[ i ] = min { j > i | A [ j ] < A [ i ] } − i, with NSV[ i ] = n − i + 1 if no such j exists. 4 P . Negri et al. # 1 b 2 a 3 n 4 a 5 n 6 a 7 $ 8 NSS PSS Fig. 1. NSS and PSS edges for x = # banana $ . Dashed arcs denote next smaller suﬃx edges, while solid arcs denote previous smaller suﬃx edges. 2.1 The NSS/PSS F ramew ork W e brieﬂy recall the framework of Ellert [ 13 ] for computing λ in O ( n ) time, since it is the template for our inv erse construction. Throughout this section we assume x = # x (1 ..n ) $ with # < $ < a for all a ∈ Σ . Deﬁnition 2 (NSS and PSS Arra ys). F or a wor d x [1 ..n ] , deﬁne next [ i ] = min { j ∈ ( i, n ] | x j ≺ x i } , pr ev [ i ] = max { j ∈ [1 , i ) | x j ≺ x i } , with next [1] = next [ n ] = n +1 , pr ev [1] = 0 , and pr ev [ n ] = 1 . Example 1. F or x = # banana $ , with # < $ < a < b < n , the arrays are: i 1 2 3 4 5 6 7 8 x # b a n a n a $ next [ i ] 9 3 5 5 7 7 8 9 pr ev [ i ] 0 1 1 3 1 5 1 1 λ [ i ] 8 1 2 1 2 1 1 1 Figure 1 sho ws the corresp onding NSS and PSS edges. Lemma 1 ([13]). F or al l i ∈ [1 , n ] , next [ i ] = i + λ [ i ] . Lemma 2 ([13]). F or al l i ∈ [2 , n ] , the wor d x [ pr ev [ i ] ..i − 1] is a Lyndon wor d. The k ey prop erties enabling linear-time construction are the following. Lemma 3 (Non-Crossing [ 13 ]). L et l 1 < r 1 and l 2 < r 2 b e index p airs, e ach c onne cte d by an NSS or PSS e dge. Then it is imp ossible to have l 1 < l 2 < r 1 < r 2 . Lemma 4 (Chain Iteration [13]). F or any r ∈ [2 , n ] : 1. pr ev [ r ] = pr ev ∗ [ r − 1] , 2. next [ l ] = r if and only if l = pr ev ∗ [ r − 1] and l > pr ev [ r ] , wher e pr ev ∗ [ r − 1] denotes an element on the PSS chain fr om r − 1 . The Inv erse Lyndon Array 5 Lemma 5 (LCE A cceleration [13]). L et l = pr ev [ k ] and r = next [ k ] . Then: 1. if lce( l, k ) = lce( k, r ) , then lce( l, r ) ≥ lce( k , r ) ; 2. if lce( l, k ) < lce( k, r ) , then lce( l, r ) = lce( l , k ) and pr ev [ r ] = l ; 3. if lce( l, k ) > lce( k, r ) , then lce( l, r ) = lce( k , r ) and next [ l ] = r . T ogether with the SMAR T-LCE pro cedure of [ 13 ], whic h computes all required LCE v alues in amortized O ( n ) time via a global righ tmost-insp ected-p osition v ariable, these lemmas yield an algorithm that computes λ in O ( n ) time and O ( n ) space on general ordered alphab ets. 3 The In verse Lyndon Arra y W e now develop the theory of the inv erse Lyndon arra y . Throughout this section, x = # x (1 ..n ) $ , with sentinels satisfying # > $ > a for all a ∈ Σ . 3.1 In verse Lyndon W ords Deﬁnition 3 ([ 11 ]). A wor d w ∈ Σ + is an in verse Lyndon word if s ≺ w for every non-empty pr op er suﬃx s of w . Unlik e standard Lyndon words, inv erse Lyndon w ords may hav e non-trivial b orders. F or example, dab da is an inv erse Lyndon word with maximal border da . The follo wing preﬁx-closure prop ert y is fundamental [11]. Lemma 6 ([ 11 ]). Every non-empty pr eﬁx of an inverse Lyndon wor d is an inverse Lyndon wor d. Deﬁnition 4 (Maximal In verse Lyndon Sub w ord). A subwor d x [ i..j ] is a maximal inv erse Lyndon sub word starting at i if it is an inverse Lyndon wor d and either j = n or x [ i..j +1] is not an inverse Lyndon wor d. Deﬁnition 5 (In verse Lyndon Arra y). The Inverse Lyndon A rr ay is deﬁne d by: λ − 1 [ i ] = max { m ∈ [1 , n − i + 1] | x [ i..i + m − 1] is an inverse Lyndon wor d } . Example 2. Let x = # aab abb aa $ , with # > $ > b > a . The inv erse Lyndon arra y and the asso ciated NGS/PGS arrays are: i 1 2 3 4 5 6 7 8 9 10 x # a a b a b b a a $ λ − 1 10 2 1 3 1 4 3 2 1 1 next − 1 11 3 4 6 6 10 10 9 10 11 pr ev − 1 0 1 1 1 4 1 6 7 7 1 nlc e 0 1 0 1 0 0 0 1 0 0 A t i = 4 , the maximal inv erse Lyndon sub word is x [4 .. 6] = b ab , so λ − 1 [4] = 3 . Its longest b order has length 1 , hence next − 1 [4] = 6 and nlc e [4] = 1 , in agreemen t with Corollary 1. At i = 6 , the maximal in verse Lyndon sub word is x [6 .. 9] = bb aa , so λ − 1 [6] = 4 . Since this factor is unbordered, next − 1 [6] = 10 and nlc e [6] = 0 . 6 P . Negri et al. 3.2 NGS/PGS Arra ys and their Relationship to λ − 1 Deﬁnition 6 (NGS and PGS Arra ys). F or a wor d x [1 ..n ] with inverte d sentinels, deﬁne: next − 1 [ i ] = min { j ∈ ( i, n ] | x j ≻ x i } , pr ev − 1 [ i ] = max { j ∈ [1 , i ) | x j ≻ x i } , with next − 1 [1] = next − 1 [ n ] = n +1 , pr ev − 1 [1] = 0 , and pr ev − 1 [ n ] = 1 . Th us, in the in verse setting, the relev an t edges connect nearest gr e ater suﬃxes rather than nearest smaller suﬃxes. Lemma 7 (NGS and λ − 1 equiv alence). L et z b e the maximal inverse Lyndon subwor d starting at i , and let b order ( z ) denote the length of its longest pr op er b or der. Then: next − 1 [ i ] = i + λ − 1 [ i ] − b order( z ) . Pr o of. W rite z = bhb , where b ∈ Σ ∗ is the border, p ossibly empty , and h ∈ Σ + . Then x i = bhbα for some α ∈ Σ + . Let j = next − 1 [ i ] . W e claim that x j = bα , that is, the suﬃx x j starts at the second o ccurrence of the b order. Supp ose ﬁrst that x j starts b efore that o ccurrence, say x j = β bα with β ∈ Σ + and z = bγ β b . Then β b is a prop er suﬃx of z , so by Deﬁnition 3 we hav e z ≻ β b . Hence z α ≻ β bα , that is, x i ≻ x j , con tradicting the choice of j . Supp ose instead that x j starts after the border, sa y x j = β α with b = γ β and γ , β ∈ Σ + . Consider the intermediate suﬃx x k m = β hγ β α . Since j = next − 1 [ i ] , w e hav e x k m ≺ x i ≺ x j . Ho wev er, from x j = β α ≻ β hγ β α = x k m , w e obtain bα = γ β α ≻ γ β hγ β α = x i , whic h implies x k m ≻ x i , again a con tradiction. Therefore x j = bα , and so j = i + | z | − | b | = i + λ − 1 [ i ] − b order( z ) . The b order term is the main structural diﬀerence with the standard case. Standard Lyndon words are unbordered, so the corresp onding correction term is alw ays zero. Lemma 8 (Border via LCE). L et z b e the maximal inverse Lyndon subwor d starting at i , and let j = next − 1 [ i ] . Then lce( i, j ) = b order( z ) . Pr o of. Using the notation ab o ve, Lemma 7 yields x j = bα . Since x i = bhbα and x j = bα , the suﬃxes x i and x j agree on the preﬁx b , so lce( i, j ) ≥ | b | . F or the upp er bound, note that x j ≻ x i since j = next − 1 [ i ] . The tw o suﬃxes share a common preﬁx of length | b | (b oth b egin with b ); at the next position x i [ | b | + 1] = h [1] while x j [ | b | + 1] = α [1] . Since x j ≻ x i and they agree on the ﬁrst | b | p ositions, the comparison is decided at p osition | b | + 1 , whic h forces α [1] > h [1] . In particular α [1]  = h [1] , so the common preﬁx ends exactly at length | b | and lce( i, j ) = | b | = border( z ) . The Inv erse Lyndon Array 7 b h b α maximal inv erse Lyndon word z = bhb suﬃx x i suﬃx x j i j = next − 1 [ i ] lce( i, j ) = | b | λ − 1 [ i ] = ( j − i ) + | b | start of the next greater suﬃx Fig. 2. Border correction for the inv erse Lyndon array . If the maximal inv erse Lyndon w ord starting at p osition i has the form z = bhb , then the next greater suﬃx starts at the second o ccurrence of b . Consequently , lce( i, j ) = | b | and λ − 1 [ i ] = ( j − i ) + | b | . Corollary 1. F or every i , λ − 1 [ i ] = next − 1 [ i ] − i + lce( i, next − 1 [ i ]) . R emark 1 (Why the b or der term is structur al). The correction term in Corollary 1 is not a p ost-pro cessing detail. Without Lemma 8, the in verse arra y w ould require explicit b order information for each maximal inv erse Lyndon factor. That w ould in tro duce an additional computational lay er not presen t in the standard case. The point of Lemma 8 is precisely that the border is already encoded b y the same nearest-suﬃx LCE information computed b y the algorithm. Th us, computing next − 1 together with the asso ciated nlc e arra y suﬃces to reco ver λ − 1 . Lemma 9. F or al l i ∈ [2 , n ] , x [ pr ev − 1 [ i ] ..i − 1] is an inverse Lyndon wor d. Pr o of. Let p = pr ev − 1 [ i ] and w = x [ p..i − 1] . By deﬁnition of PGS, x p ≻ x i and x k ≺ x i for ev ery k ∈ ( p, i ) . By transitivity , x p ≻ x k for ev ery k ∈ ( p, i ) . Let s = x [ k ..i − 1] b e any prop er non-empty suﬃx of w , with k ∈ ( p, i ) . Since x p ≻ x k , let m ≥ 1 b e the ﬁrst index where x p and x k diﬀer; then x [ p + m − 1] > x [ k + m − 1] . If m ≤ i − k = | s | , the mismatc h falls within both w and s , so w ≻ s . If m > i − k , then s = x [ k ..i − 1] is a proper preﬁx of w (they agree on all | s | p ositions of s ), so w ≻ s b y the preﬁx rule of lexicographic order. In b oth cases w ≻ s , so w is an inv erse Lyndon word. 3.3 Com binatorial Prop erties of NGS/PGS Edges W e now show that the main structural lemmas from the NSS/PSS setting remain v alid in the inv erse case. These lemmas also ﬁt into the broader context of non- crossing LCE-based techniques on gene ral ordered alphab ets. Here, how ever, the non-crossing structure is induced sp eciﬁcally by NGS/PGS edges in the inv erse Lyndon setting. 8 P . Negri et al. Lemma 10 (Non-Crossing of NGS/PGS Edges). L et l 1 < r 1 and l 2 < r 2 b e index p airs, e ach c onne cte d by an NGS or PGS e dge. Then it is imp ossible to have l 1 < l 2 < r 1 < r 2 . Pr o of. Assume for con tradiction that l 1 < l 2 < r 1 < r 2 . Since l 2 ∈ ( l 1 , r 1 ) and l 1 , r 1 are connected b y an NGS or PGS edge, ev ery suﬃx starting in ( l 1 , r 1 ) is lexicographically smaller than b oth endp oin ts, hence x l 2 ≺ x r 1 . On the other hand, since r 1 ∈ ( l 2 , r 2 ) and l 2 , r 2 are connected, w e get x l 2 ≻ x r 1 . This con tradiction pro ves the claim. Lemma 11 (Chain Iteration for NGS/PGS). F or any r ∈ [2 , n ] : 1. pr ev − 1 [ r ] = pr ev ∗ − 1 [ r − 1] ; 2. next − 1 [ l ] = r if and only if l = pr ev ∗ − 1 [ r − 1] and l > pr ev − 1 [ r ] . Pr o of. F or the ﬁrst statemen t, suppose pr ev − 1 [ r ]  = pr ev ∗ − 1 [ r − 1] . Then there exists r ′ = pr ev ∗ − 1 [ r − 1] such that pr ev − 1 [ r ] ∈ ( pr ev − 1 [ r ′ ] , r ′ ) , giving pr ev − 1 [ r ′ ] < pr ev − 1 [ r ] < r ′ < r, whic h contradicts Lemma 10. F or the second statement, assume ﬁrst that next − 1 [ l ] = r . Then all suﬃxes starting in ( l, r ) are smaller than x r , so pr ev − 1 [ r ] / ∈ [ l, r ) and therefore pr ev − 1 [ r ] < l . If l  = pr ev ∗ − 1 [ r − 1] , then there exists r ′ = pr ev ∗ − 1 [ r − 1] such that pr ev − 1 [ r ′ ] < l < r ′ < r, again con tradicting Lemma 10. Con versely , assume that pr ev − 1 [ r ] < l and l = pr ev ∗ − 1 [ r − 1] . Since l ∈ ( pr ev − 1 [ r ] , r ) , we hav e x l ≺ x r . Moreov er, the chain prop erty implies x k ≺ x l for ev ery k ∈ ( l, r ) . Therefore r is the ﬁrst p osition to the right of l with a greater suﬃx, namely next − 1 [ l ] = r . Lemma 12 (LCE Acceleration for NGS/PGS). L et l = pr ev − 1 [ k ] and r = next − 1 [ k ] . Then: 1. if lce ( l, k ) = lce ( k , r ) , then lce ( l, r ) ≥ lce ( k , r ) , and either pr ev − 1 [ r ] = l or next − 1 [ l ] = r ; 2. if lce( l, k ) < lce( k, r ) , then lce( l, r ) = lce( l , k ) and pr ev − 1 [ r ] = l ; 3. if lce( l, k ) > lce( k, r ) , then lce( l, r ) = lce( k , r ) and next − 1 [ l ] = r . Pr o of. F or the ﬁrst case, since l = pr ev − 1 [ k ] and r = next − 1 [ k ] , we hav e x l ≻ x k ≻ x m and x r ≻ x k ≻ x m for all m ∈ ( l, r ) \ { k } . Since lce ( l, k ) = lce ( k , r ) , it follo ws that lce ( l, r ) ≥ lce ( k , r ) . If x l ≻ x r , then x l ≻ x r ≻ x m for all such m , so pr ev − 1 [ r ] = l . By symmetry , if x r ≻ x l , then next − 1 [ l ] = r . F or the second case, let c = lce ( l, k ) . Since x l ≻ x k , w e hav e x [ l + c ] > x [ k + c ] . Because lce ( l, k ) < lce ( k , r ) , the suﬃx x r shares a preﬁx of length at least c + 1 with x k , hence x [ r + c ] = x [ k + c ] . Thus x l and x r agree for c p ositions and diﬀer at the next one with x [ l + c ] > x [ r + c ] , so x l ≻ x r and lce ( l, r ) = c . Since x r ≻ x m for all m ∈ ( l , r ) , it follows that pr ev − 1 [ r ] = l . The third case is symmetric, yielding x r ≻ x l , lce ( l, r ) = lce ( k , r ) , and next − 1 [ l ] = r . The Inv erse Lyndon Array 9 Input: word x [1 ..n ] in inv erse sentinel mo de Output: next − 1 , pr ev − 1 , nlc e , plc e , λ − 1 pr ev − 1 [1] ← 0 ; next − 1 [ n ] ← n + 1 ; plc e [1] ← 0 for r ← 2 to n do ℓ ← r − 1 ; m ← Smar t-Lce ( ℓ, r, 0) while x [ ℓ + m ] < x [ r + m ] do next − 1 [ ℓ ] ← r ; nlc e [ ℓ ] ← m if m = plc e [ ℓ ] then m ← Smar t-Lce ( pr ev − 1 [ ℓ ] , r , m ) end else if m > plc e [ ℓ ] then m ← plc e [ ℓ ] end ℓ ← pr ev − 1 [ ℓ ] end pr ev − 1 [ r ] ← ℓ ; plc e [ r ] ← m end for i ← 1 to n do λ − 1 [ i ] ← next − 1 [ i ] − i + nlc e [ i ] end Algorithm 1: LCE-NGS for the in verse Lyndon array 3.4 The LCE-NGS Algorithm Lemmas 11 and 12 yield Algorithm 1, which mirrors the structure of the standard LCE-NSS algorithm [ 13 ]. The key diﬀerence is the direction of the comparison on line 8: in the inv erse setting we test whether x [ ℓ + m ] < x [ r + m ] , b ecause w e are lo oking for the ﬁrst suﬃx to the right that is lexicographically greater. The pro cedure Smar t-Lce is the same amortized primitive used in the standard setting [ 13 ], but here it is queried only on NGS/PGS candidates. F or self-con tainment, w e recall the t w o inv ariants that are used in the pro of of Theorem 2. First, after eac h explicit scan, rhs is the righ tmost text p osition insp ected so far. Second, whenever an edge v alue has already been ﬁxed, the corresp onding stored quantit y in nlc e or plc e is exactly the required LCE for that edge. Hence, if a query ( ℓ, r, m ) satisﬁes r + m < rhs , then the answer lies en tirely inside a previously certiﬁed window and can b e returned in O (1) time from the stored edge information selected b y Lemma 12. Otherwise, Smar t-Lce extends explicitly , increases rhs , and stores the resulting LCE v alue. This is the only place where new c haracter comparisons are p erformed. Theorem 2. A lgorithm 1 c omputes the inverse Lyndon arr ay λ − 1 [1 ..n ] in O ( n ) time and O ( n ) sp ac e on gener al or der e d alphab ets. Pr o of. Correctness follo ws from Lemmas 7, 8, 10, 11, and 12. The additional p oin t to justify is that the b order correction do es not create extra rescanning. By Lemma 8, whenever next − 1 [ ℓ ] = r is ﬁxed, the same v alue nlc e [ ℓ ] = lce ( ℓ, r ) is b oth the edge LCE needed b y the algorithm and the border correction needed later for λ − 1 [ ℓ ] . Therefore the algorithm never computes b orders explicitly and nev er p erforms a separate pass for b order information. 10 P . Negri et al. Moreo ver, if Smar t-Lce is called with r + m < rhs , then the query lies en tirely inside a region already certiﬁed b y a previous explicit scan. In that case, Lemma 12 and the stored nlc e or plc e v alues suﬃce to answ er the query without rescanning c haracters inside that windo w. Thus b orders cannot trigger hidden rescanning there. Every explicit c haracter comparison strictly adv ances the global fron tier rhs , which is monotone and never exceeds n . Hence the total num b er of explicit comparisons is O ( n ) . The outer lo op p erforms n − 1 iterations. Each index receives its ﬁnal next − 1 v alue once, eac h stored LCE v alue is written once, and the ﬁnal recov ery formula λ − 1 [ i ] = next − 1 [ i ] − i + nlc e [ i ] is applied once p er p osition. Therefore the total running time is O ( n ) and the space usage is O ( n ) . 4 Exp erimen tal Ev aluation W e implemen ted LCE-NSS for λ and LCE-NGS for λ − 1 in C++17. The tim- ing runs rep orted here w ere produced b y the b enc hmark pipeline with g++ -std=c++17 -O2 . W e measured wall-clock construction time and, when relev ant, rep ort the recov ery pass separately . All exp erimen ts w ere run on a Dell Precision 7960 T o wer under Ubun tu Lin ux, with an Intel Xeon w5-3425 CPU and 128 GB of ECC RAM. Correctness was c heck ed against brute force on small and medium inputs and on crafted edge cases. W e used random w ords ov er alphab ets of size σ ∈ { 2 , 4 , 26 } , structured syn thetic families, and real texts from Pizza&Chili and the Large Canterbury Corpus [ 21 , 22 , 23 ]. The proﬁling-enabled implemen tation also counted explicit c haracter comparisons, LCE reuse hits inside the certiﬁed Smar t-Lce window, and explicit extension calls. T able 1 reports the random-input timings. On instances with n ≥ 5 × 10 4 , the mean ratio LCE-NGS / LCE-NSS is 1 . 0060 , the median ratio is 0 . 9990 , and the observed range is [0 . 9882 , 1 . 0939] , which shows that the inv erse construction matc hes the standard one closely across all alphab et sizes. T able 2 reports the results obtained on real ﬁles. A cross the ﬁv e corp ora, the mean ratio is 1 . 0019 , the median ratio is 1 . 0024 , and the observed range is [0 . 9665 , 1 . 0325] . The structured synthetic families show the same practical b eha vior. Over all structured instances with n ≥ 5 × 10 4 , the mean ratio is 0 . 9852 , the median ratio is 0 . 9981 , and the observed range is [0 . 8848 , 1 . 0874] . Th us the in verse kernel remains close to the standard one across random, structured, and real inputs. T o stress the b order-correction path, w e additionally proﬁled long-b order families while recording explicit c haracter comparisons, LCE reuse hits, and explicit extension calls inside Smar t-Lce . Eac h instance contains a repeated preﬁx-suﬃx b order o ccup ying either 25% or 40% of the full length. The timing ratios on these b order-hea vy inputs remain close to one, mean 1 . 0013 , median 1 . 0031 , and range [0 . 9680 , 1 . 0257] . T able 3 shows that all counters remain linear The Inv erse Lyndon Array 11 T able 1. Core construction time ( µ s) on random words. σ = 2 σ = 4 σ = 26 n LCE-NSS LCE-NGS LCE-NSS LCE-NGS LCE-NSS LCE-NGS 10 3 15.4 15.0 15.8 15.4 15.6 16.0 5 × 10 3 79.6 84.2 95.6 101.2 94.0 92.6 10 4 171.4 179.0 205.4 207.0 194.0 190.6 5 × 10 4 948.2 1 037.2 1 062.0 1 059.2 1 006.6 1 000.6 10 5 1 974.2 1 973.4 2 187.0 2 162.4 2 060.6 2 053.8 5 × 10 5 9 197.8 9 416.2 10 503.2 10 543.2 10 087.8 9 981.4 10 6 18 060.0 18 320.2 20 972.4 20 955.0 20 121.8 19 919.2 2 × 10 6 35 778.4 36 154.0 41 893.6 41 840.6 40 219.0 39 744.4 5 × 10 6 89 178.6 90 033.6 102 314.0 104 479.2 100 569.0 99 436.6 T able 2. Core construction time ( µ s) on real corp ora. File n LCE-NSS LCE-NGS Ratio english.txt 5 000 000 98 850.6 98 846.6 1.0000 dna.txt 5 000 000 104 074.2 104 328.6 1.0024 bible.txt 4 047 392 75 372.2 77 819.2 1.0325 e.coli 4 638 690 96 704.0 93 460.0 0.9665 world192.txt 2 473 400 46 763.6 47 144.6 1.0081 in n . In particular, from 10 5 to 10 6 , b oth explicit comparison counts and explicit extension counts grow b y ab out 10 × in b oth families, which matches Theorem 2. Deep b orders do aﬀect constants, but they do not trigger superlinear rescanning. 5 Conclusions and Suﬃx-Sorting P ersp ectiv es W e in tro duced the in verse Lyndon arra y and c haracterized it in terms of the next greater suﬃx array and a b order correction term. W e also sho wed that the combinatorial ingredients b ehind the NSS/PSS framew ork transfer to the NGS/PGS setting, yielding a linear-time construction on general ordered alpha- b ets. The cen tral tec hnical p oint is that borders do not create hidden rescanning costs: the required correction term is already enco ded by LCE v alues on nearest- greater-suﬃx edges and is absorbed by the same amortized machinery as in the standard case. Experimentally , the inv erse construction preserves the same practical linear-time proﬁle as the standard one. A natural p ersp ectiv e is suﬃx sorting with the aid of the standard and in verse Lyndon arra ys. It is w ell kno wn that the standard Lyndon arra y supports suﬃx sorting through the compatibilit y prop ert y w λ ( i ) ≺ w λ ( j ) = ⇒ x i ≺ x j , where w λ ( i ) denotes the maximal Lyndon sub word starting at i [9,15]. By con trast, the inv erse case do es not admit a direct analogue. 12 P . Negri et al. T able 3. Border-heavy proﬁling counters. V alues are av erages ov er three runs. F amily n NSS cmp. NGS cmp. NSS reuse NGS reuse NSS ext. NGS ext. 25% border 100 000 213 984 213 427 29 347 29 258 148 890 148 233 40% border 100 000 200 510 208 599 44 719 35 317 131 483 141 882 25% border 500 000 945 829 1 039 872 294 036 179 692 582 948 705 698 40% border 500 000 947 176 863 646 292 529 393 730 584 150 475 202 25% border 1 000 000 2 134 164 2 134 660 293 205 293 234 1 481 797 1 483 100 40% border 1 000 000 2 135 385 2 123 841 292 240 306 666 1 483 741 1 468 012 Prop osition 1 (No direct inv erse compatibility). Ther e is no dir e ct inverse analo gue of the standar d c omp atibility pr op erty for maximal inverse Lyndon factors. In p articular, it is not true in gener al that w λ − 1 ( i ) ≺ w λ − 1 ( j ) = ⇒ x i ≺ x j . Pr o of. T ak e x = b ab acb ab aa with i = 1 and j = 6 . Then w λ − 1 (1) = b ab a and w λ − 1 (6) = b ab aa , so w λ − 1 (1) ≺ w λ − 1 (6) , but x 1 = b ab acb ab aa ≻ x 6 = b ab aa . The obstruction is the preﬁx case: if an in verse Lyndon word has the form z = z ′ α , one may hav e z ≻ α while only z ′ ⪰ α , so the contin uation of the suﬃx is not determined b y the maximal inv erse factors alone. Nev ertheless, the in verse Lyndon array still provides constant-time comparison rules for relev an t suﬃx pairs. Prop osition 2. F or p ositions i < j , e ach of the fol lowing c onditions determines the or der of x i and x j in c onstant time: 1. if j < i + λ [ i ] , then x i ≺ x j ; 2. if j = next − 1 [ i ] , e quivalently j = i + λ − 1 [ i ] − nlc e [ i ] , then x i ≺ x j ; 3. if j < next − 1 [ i ] , e quivalently j < i + λ − 1 [ i ] − nlc e [ i ] , then x i ≻ x j ; 4. if i = pr ev [ j ] or i = pr ev − 1 [ j ] , the c omp arison fol lows imme diately fr om the deﬁning e dge. Pr o of. Eac h item follows directly from the deﬁnitions. Com bining the present results with other prop erties of ICFL may support eﬃcien t approac hes to suﬃx sorting, a direction that also aligns with the broader formal-language p erspective discussed in [27]. Disclosure of In terests. The authors hav e no comp eting interests to declare that are relev an t to the con tent of this article. The Inv erse Lyndon Array 13 References 1. Lyndon, R.C.: On Burnside’s problem. II. T rans. Amer. Math. So c. 78 (2), 329–332 (1955). https://doi.org/10.1090/S0002- 9947- 1955- 0067884- 8 2. Duv al, J.P .: F actorizing w ords o ver an ordered alphabet. J. Algorithms 4 (4), 363–381 (1983). https://doi.org/10.1016/0196- 6774(83)90017- 2 3. Man b er, U., Myers, G.: Suﬃx arrays: A new method for on-line w ord searc hes. SIAM J. Comput. 22 (5), 935–948 (1993). https://doi.org/10.1137/0222058 4. Manac her, G.K.: A new linear-time “on-line” algorithm for ﬁnding the smallest initial palindrome of a word. J. ACM 22 (3), 346–351 (1975) 5. Alstrup, S., Gav oille, C., Kaplan, H., Rauhe, T.: Nearest common ancestors: a surv ey and a new distributed algorithm. In: Pro c. SP AA 2002, pp. 258–264. ACM (2002). https://doi.org/10.1145/564870.564914 6. F ranek, F., Islam, A.S.M.S., Rahman, M.S., Smyth, W.F.: Algorithms to compute the Lyndon arra y . In: Proc. Prague Stringology Conference 2016, pp. 172–184 (2016) 7. F ranek, F., Liut, M.: Algorithms to compute the Lyndon array revisited. In: Pro c. Prague Stringology Conference 2019, pp. 16–28 (2019) 8. Bannai, H., I, T., Inenaga, S., Nak ashima, Y., T ak eda, M., T suruta, K.: The “runs” theorem. SIAM J. Comput. 46 (5), 1501–1514 (2017). https://doi.org/10.1137/ 15M1011032 9. Baier, U.: Linear-time suﬃx sorting, a new approach for suﬃx array construction. In: CPM 2016, LIPIcs, vol. 54, pp. 23:1–23:12. Schloss Dagstuhl (2016). ht tps : //doi.org/10.4230/LIPIcs.CPM.2016.23 10. F ranek, F., Parac ha, A., Smyth, W.F.: The linear equiv alence of the suﬃx arra y and the partially sorted Lyndon array . In: Pro c. Prague Stringology Conference 2017, pp. 77–84 (2017) 11. Bonizzoni, P ., De F elice, C., Zaccagnino, R., Zizza, R.: In verse Lyndon words and in verse Lyndon factorizations of words. Adv. Appl. Math. 101 , 281–319 (2018). https://doi.org/10.1016/j.aam.2018.08.005 12. Bonizzoni, P ., De F elice, C., Riccardi, B., Zaccagnino, R., Zizza, R.: Unv eiling the connection b et ween the Lyndon factorization and the canonical inv erse Lyndon factorization via a b order prop ert y . In: MF CS 2024, LIPIcs, v ol. 306, pp. 31:1–31:14. Sc hloss Dagstuhl (2024). https://doi.org/10.4230/LIPIcs.MFCS.2024.31 13. Ellert, J.: Lyndon arra ys simpliﬁed. In: ESA 2022, LIPIcs, vol. 244, pp. 48:1–48:14. Sc hloss Dagstuhl (2022). https://doi.org/10.4230/LIPIcs.ESA.2022.48 14. Bille, P ., Ellert, J., Fisc her, J., Gørtz, I.L., Kurpicz, F., Munro, I., Rotenberg, E.: Space eﬃcient construction of Lyndon arra ys in linear time. CoRR abs/1911.03542 (2019) 15. Bertram, N., Ellert, J., Fischer, J.: Lyndon words accelerate suﬃx sorting. In: ESA 2021, LIPIcs, vol. 204, pp. 15:1–15:13. Schloss Dagstuhl (2021). https://doi.org/ 10.4230/LIPIcs.ESA.2021.15 16. Bannai, H., Ellert, J.: Lyndon arra ys in sublinear time. In: ESA 2023, LIPIcs, vol. 274, pp. 14:1–14:16. Schloss Dagstuhl (2023). https://doi.org/10.4230/LIPIcs .ESA.2023.14 17. Badk ob eh, G., Cro c hemore, M., Ellert, J., Nicaud, C.: Bac k-to-front online Lyndon forest construction. In: CPM 2022, LIPIcs, vol. 223, pp. 13:1–13:23. Schloss Dagstuhl (2022). https://doi.org/10.4230/LIPIcs.CPM.2022.13 18. Bonizzoni, P ., et al.: Numeric Lyndon-based feature embedding of sequencing reads for mac hine learning approaches. Inform. Sci. 607 , 458–476 (2022). https: //doi.org/10.1016/j.ins.2022.06.005 14 P . Negri et al. 19. Bauer, M.J., Cox, A.J., Rosone, G.: Light weigh t algorithms for constructing and in verting the BWT of word collections. Theoret. Comput. Sci. 483 , 134–148 (2013). https://doi.org/10.1016/j.tcs.2012.02.002 20. Fisc her, J., Kurpicz, F.: Dismantling DivSufSort. CoRR abs/1710.01896 (2017) 21. F erragina, P ., Na v arro, G.: The Pizza&Chili Corpus. Dataset and b enc hmark collection for compressed indexes (2007). https://pizzachili.dcc.uchile.cl/ 22. Arnold, R., Bell, T.: A corpus for the ev aluation of lossless compression algorithms. In: Pro ceedings of the Data Compression Conference, pp. 201–210. IEEE Computer So ciet y (1997) 23. The Canterbury Corpus: Descriptions of the corpora. Univ ersity of Canterbury . https://corpus.canterbury.ac.nz/descriptions/ 24. Cro c hemore, M., Iliop oulos, C.S., K o ciumak a, T., Kundu, R., Pissis, S.P ., Ra- doszewski, J., Rytter, W., W aleń, T.: Near-Optimal Computation of Runs o ver General Alphab et via Non-Crossing LCE Queries. In: String Pro cessing and In- formation Retriev al (SPIRE 2016), LNCS, v ol. 9954, pp. 22–34. Springer, Cham (2016). https://doi.org/10.1007/978- 3- 319- 46049- 9_3 25. Ellert, J., Fisc her, J.: Linear Time Runs Over General Ordered Alphab ets. In: 48th In ternational Colloquium on Automata, Languages, and Programming (ICALP 2021), LIPIcs, vol. 198, pp. 63:1–63:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021). https://doi.org/10.4230/LIPIcs.ICALP.2021.63 26. Bonizzoni, P ., De F elice, C., Zaccagnino, R., Zizza, R.: On the longest common preﬁx of suﬃxes in an inv erse Lyndon factorization and other prop erties. Theor. Comput. Sci. 862 , 24–41 (2021). https://doi.org/10.1016/j.tcs.2020.10.034 27. Bonizzoni, P ., De F elice, C., Pirola, Y., Rizzi, R., Zaccagnino, R., Zizza, R.: Can F ormal Languages Help Pangenomics to Represent and Analyze Multiple Genomes? In: Diekert, V., V olko v, M.V. (eds.) Dev elopments in Language Theory , DL T 2022. LNCS, vol. 13257, pp. 3–12. Springer, Cham (2022). https://doi.org/10.1007/ 978- 3- 031- 05578- 2_1

The Inverse Lyndon Array: Definition, Properties, and Linear-Time Construction

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment