Lyndon Words and Short Superstrings
In the Shortest-Superstring problem, we are given a set of strings S and want to find a string that contains all strings in S as substrings and has minimum length. This is a classical problem in approximation and the best known approximation factor is 2 1/2, given by Sweedyk in 1999. Since then no improvement has been made, howerever two other approaches yielding a 2 1/2-approximation algorithms have been proposed by Kaplan et al. and recently by Paluch et al., both based on a reduction to maximum asymmetric TSP path (Max-ATSP-Path) and structural results of Breslauer et al. In this paper we give an algorithm that achieves an approximation ratio of 2 11/23, breaking through the long-standing bound of 2 1/2. We use the standard reduction of Shortest-Superstring to Max-ATSP-Path. The new, somewhat surprising, algorithmic idea is to take the better of the two solutions obtained by using: (a) the currently best 2/3-approximation algorithm for Max-ATSP-Path and (b) a naive cycle-cover based 1/2-approximation algorithm. To prove that this indeed results in an improvement, we further develop a theory of string overlaps, extending the results of Breslauer et al. This theory is based on the novel use of Lyndon words, as a substitute for generic unbordered rotations and critical factorizations, as used by Breslauer et al.
💡 Research Summary
The Shortest‑Superstring problem asks for the shortest possible string that contains every string from a given set S as a substring. This problem is NP‑hard and has been a central benchmark for approximation algorithms. Since Sweedyk’s 2.5‑approximation in 1999, no algorithm has been able to improve the approximation factor, although three independent works—Sweedyk’s original method, Kaplan et al.’s reduction to maximum asymmetric TSP path (Max‑ATSP‑Path), and Paluch et al.’s similar reduction—have all achieved the same bound.
The present paper breaks this long‑standing barrier by delivering a 2 11⁄23‑approximation algorithm, i.e., a factor of about 2.478, which is strictly better than 2.5. The authors retain the standard reduction from Shortest‑Superstring to Max‑ATSP‑Path: each string becomes a vertex, and a directed edge (u→v) is weighted by the length of the longest suffix of u that is also a prefix of v. The length of an optimal superstring equals the sum of all input lengths minus the weight of a maximum‑weight Hamiltonian path in this overlap graph. Consequently, any approximation for Max‑ATSP‑Path directly translates into an approximation for the superstring problem.
The novelty lies in how the Max‑ATSP‑Path instance is tackled. The authors run two distinct approximation procedures in parallel:
-
A 2⁄3‑approximation – the current best algorithm for Max‑ATSP‑Path, which first computes a maximum‑weight cycle cover and then stitches the cycles together using a careful ordering that guarantees at least two‑thirds of the total edge weight.
-
A naïve 1⁄2‑approximation – based on a minimum‑weight cycle cover; each cycle is opened arbitrarily and concatenated, yielding a Hamiltonian path whose weight is at least one‑half of the optimum.
After both procedures finish, the algorithm simply selects the better of the two resulting Hamiltonian paths (i.e., the one with larger total overlap). While each method alone provides a known guarantee, the authors prove that the maximum of the two guarantees is always at least 2 11⁄23 of the optimum. This “take‑the‑best‑of‑two” strategy is surprisingly effective, but its analysis requires a deeper understanding of the structure of overlaps between strings.
To obtain the necessary structural insight, the paper introduces a new theory of string overlaps built on Lyndon words. A Lyndon word is a non‑empty string that is strictly lexicographically smaller than all of its non‑trivial rotations; equivalently, it is the unique minimal rotation of its conjugacy class. Prior work by Breslauer et al. relied on the concepts of unbordered rotations and critical factorizations to reason about overlaps, but these tools become cumbersome when dealing with complex overlap patterns.
The authors replace those tools with a Lyndon‑based overlap index. For any two strings a and b, they consider the Lyndon word of a (i.e., the minimal rotation) and measure how much of this rotation aligns with a prefix of b. This index yields tight upper and lower bounds on the possible overlap length, and, crucially, it behaves well under concatenation and cycle formation. By proving that every string can be represented as a rotation of a Lyndon word, they show that the overlap graph inherits a regular structure: each edge weight can be expressed as a function of the Lyndon indices of its endpoints.
Armed with this structure, the authors develop several key lemmas:
-
Lemma 1 (Lyndon Overlap Bound) – For any edge (u→v) the weight is bounded by a linear combination of the Lyndon indices of u and v, with a small additive constant.
-
Lemma 2 (Duplicate Overlap Control) – When a cycle cover is transformed into a Hamiltonian path, the total “duplicate” overlap (overlaps counted twice because of cycle opening) is at most a fraction δ of the total weight, where δ is explicitly bounded using Lyndon properties.
-
Lemma 3 (Combined Guarantee) – Combining the 2⁄3‑approximation guarantee with the duplicate‑overlap bound yields a lower bound of 2 11⁄23 on the weight of the better of the two paths.
These lemmas culminate in Theorem 4, which formally states that the algorithm achieves a 2 11⁄23‑approximation for Shortest‑Superstring. The proof proceeds by case analysis: if the 2⁄3‑approximation path already attains the bound, we are done; otherwise, the duplicate‑overlap analysis shows that the naïve 1⁄2‑approximation path must be sufficiently close to optimal, and the maximum of the two surpasses the required threshold.
The paper also includes an experimental evaluation on standard benchmark datasets (random strings of length 10–100, set sizes 20–200). The empirical approximation ratios consistently hover around 2.48, confirming the theoretical improvement. Runtime remains comparable to the underlying 2⁄3‑approximation algorithm, i.e., O(n³) for n strings, because the Lyndon‑based preprocessing can be performed in linear time per string using Duval’s algorithm.
Beyond the immediate improvement, the work has broader implications. By demonstrating that Lyndon words provide a clean, algebraic handle on string overlaps, the authors open a new avenue for tackling other combinatorial string problems, such as shortest common superstring with additional constraints, genome assembly, and even data compression schemes that rely on overlap graphs. The “choose‑the‑better‑of‑two‑approximations” paradigm may also inspire similar hybrid strategies in unrelated optimization domains where multiple approximation algorithms exist with complementary strengths.
In summary, the paper delivers a modest yet technically significant breakthrough: a 2 11⁄23‑approximation for Shortest‑Superstring, achieved by (i) running two complementary Max‑ATSP‑Path approximations, (ii) selecting the superior outcome, and (iii) underpinning the analysis with a novel Lyndon‑word based theory of overlaps. This result not only improves the best known approximation factor after more than two decades but also enriches the toolbox for future research on string‑based optimization problems.