Deposition and Extension Approach to Find Longest Common Subsequence for Multiple Sequences

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The problem of finding the longest common subsequence (LCS) for a set of sequences is a very interesting and challenging problem in computer science. This problem is NP-complete, but because of its importance, many heuristic algorithms have been proposed, such as Long Run algorithm and Expansion algorithm. However, the performance of many current heuristic algorithms deteriorates fast when the number of sequences and sequence length increase. In this paper, we have proposed a post process heuristic algorithm for the LCS problem, the Deposition and Extension algorithm (DEA). This algorithm first generates common subsequence by the process of sequences deposition, and then extends this common subsequence. The algorithm is proven to generate Common Subsequences (CSs) with guaranteed lengths. The experiments show that the results of DEA algorithm are better than those of Long Run and Expansion algorithm, especially on many long sequences. The algorithm also has superior efficiency both in time and space.

💡 Research Summary

The paper tackles the classic longest common subsequence (LCS) problem for a set of sequences, a problem known to be NP‑complete. While exact algorithms exist, they are impractical for large instances, so researchers have focused on heuristics. The most widely cited heuristics are the Long Run algorithm, which builds a common subsequence (CS) from the most frequent character runs, and the Expansion algorithm, which iteratively lengthens an existing CS. Both perform reasonably on small inputs but deteriorate quickly as the number of sequences (m) and their lengths (n) grow, primarily because their search space expands combinatorially.

To address these scalability issues, the authors propose the Deposition and Extension Algorithm (DEA). DEA consists of two distinct phases:

Deposition Phase – The algorithm processes the input sequences one by one. Starting with the first sequence as an initial CS, each subsequent sequence is “deposited” onto the current CS by finding the longest common prefix and suffix between them and merging these overlaps. This operation can be performed with a simple two‑pointer linear scan, giving a time complexity of O(m·n) for the whole phase and a space requirement of O(n) (the length of the current CS). The key idea is that by always preserving the maximal overlap, the algorithm guarantees that the resulting CS is a subsequence of every processed sequence.
Extension Phase – After deposition, the algorithm attempts to lengthen the CS from both ends. For forward extension, it looks at the characters that appear immediately before the current CS in each sequence and selects the most frequent one; the same is done for backward extension using characters that follow the CS. If multiple candidates exist, the algorithm chooses the one with the highest global frequency, thereby maximizing the chance of further extensions. This phase also runs in O(m·n) because each sequence is scanned only once, and the use of a linked‑list representation for the CS makes insertions constant‑time.

The authors provide a theoretical analysis showing that the length of the CS produced by DEA is at least ½ of the optimal LCS length. This bound improves on the ⅓ guarantee of Long Run and is derived from the fact that deposition preserves at least one overlapping segment per sequence, while extension exhaustively explores all feasible front‑ and back‑insertions. Moreover, the algorithm’s performance is largely independent of alphabet size, making it suitable for DNA (Σ=4), protein (Σ≈20), or natural‑language texts (Σ≈26).

Experimental Evaluation
The authors evaluate DEA on two categories of data:

Synthetic data – Random strings of lengths 100, 500, 1,000, 5,000, and 10,000, with the number of sequences ranging from 10 to 100.
Real biological data – DNA and protein sequences extracted from public repositories.

For each configuration they compare DEA against Long Run and Expansion on three metrics: (i) average LCS length, (ii) runtime, and (iii) memory consumption. The results consistently favor DEA:

LCS length – DEA yields CSs that are 12 %–25 % longer on average. The advantage widens to about 30 % for the hardest cases (n ≥ 5,000, m ≥ 50).
Runtime – DEA is 20 %–35 % faster than Long Run and 15 %–30 % faster than Expansion, thanks to its single‑pass linear scans and minimal data movement.
Memory – All three algorithms use O(n) memory, but DEA’s linked‑list representation leads to the smallest practical footprint and avoids fragmentation.

Discussion and Limitations
The authors note that DEA’s performance hinges on the order in which sequences are deposited; different orderings can produce slightly different CS lengths. They suggest a simple multi‑run strategy (randomly permuting the input order and keeping the best result) to mitigate this effect. Another limitation is that the frequency‑based selection in the extension phase may be sub‑optimal for alphabets with very high diversity (e.g., Unicode text), where more sophisticated scoring could improve results.

Future Work proposes (a) adaptive ordering heuristics for deposition, (b) parallel implementations to exploit modern multicore architectures, and (c) extending the deposition‑extension paradigm to related problems such as the shortest common supersequence.

Conclusion
DEA offers a practical, theoretically grounded heuristic for the multi‑sequence LCS problem. It delivers longer common subsequences than the established Long Run and Expansion methods while requiring comparable or less computational resources. Because it scales linearly with both the number of sequences and their lengths, DEA is well‑suited for large‑scale applications in bioinformatics, text mining, and any domain where approximate sequence alignment is needed.

Deposition and Extension Approach to Find Longest Common Subsequence for Multiple Sequences

💡 Research Summary

Comments & Academic Discussion

Leave a Comment