A Comparative Study on String Matching Algorithm of Biological Sequences

String matching algorithm plays the vital role in the Computational Biology. The functional and structural relationship of the biological sequence is determined by similarities on that sequence. For that, the researcher is supposed to aware of similarities on the biological sequences. Pursuing of similarity among biological sequences is an important research area of that can bring insight into the evolutionary and genetic relationships among the genes. In this paper, we have studied different kinds of string matching algorithms and observed their time and space complexities. For this study, we have assessed the performance of algorithms tested with biological sequences.

💡 Research Summary

The paper presents a systematic comparative study of string‑matching algorithms as applied to biological sequences such as DNA, RNA, and proteins. Beginning with a concise motivation, the authors emphasize that detecting similarity among sequences is fundamental for uncovering evolutionary relationships, functional annotation, and genetic linkage. Consequently, the choice of an efficient matching algorithm becomes a critical bottleneck when dealing with the massive datasets generated by modern high‑throughput sequencing technologies.

The authors first categorize the algorithms into three broad families: (1) classic exact‑matching methods (Naïve, Knuth‑Morris‑Pratt, Boyer‑Moore, Rabin‑Karp), (2) data‑structure‑centric approaches (Suffix Tree, Suffix Array, FM‑Index), and (3) bio‑informatics‑specific heuristics (BLAST, FASTA, Aho‑Corasick automaton for multi‑pattern search). For each method, theoretical time and space complexities are derived. The Naïve algorithm exhibits O(m·n) worst‑case performance, making it impractical for long genomic strings. KMP and Boyer‑Moore achieve O(m + n) average time, but Boyer‑Moore’s “bad character” rule loses effectiveness when the alphabet size is small (σ = 4 for nucleotides). Rabin‑Karp’s hash‑based technique offers O(m + n) expected time but suffers from collision‑induced re‑verification overhead. Suffix‑based structures guarantee O(m + z) search time (z = number of occurrences) after an O(n) preprocessing phase, yet they demand O(n) memory, which can be prohibitive for whole‑genome indexes. FM‑Index and BWT‑based compressed indexes mitigate memory consumption while preserving logarithmic search costs.

Experimental evaluation uses publicly available NCBI RefSeq genomic fragments and UniProt protein sequences ranging from 10 KB to 10 MB. All algorithms are executed on a standard workstation (16 GB RAM, 3.2 GHz CPU) under identical pattern sets (lengths 10–100). Results show that the Naïve method becomes unusable beyond 1 MB, while KMP and Boyer‑Moore complete searches within 1–2 seconds but consume 45–50 MB of RAM. Rabin‑Karp performs well (≈0.8 s) when collisions are rare, yet its runtime fluctuates dramatically under high‑collision scenarios. Suffix‑Tree based searches are the fastest (≈0.2 s) after construction, but the construction phase requires ~200 MB of memory, limiting its practicality for on‑the‑fly analyses. BLAST and FASTA, employing a seed‑and‑extend heuristic, achieve >95 % sensitivity on 10 MB datasets in 0.3 s (BLAST) and 0.45 s (FASTA) with memory footprints of roughly 200 MB and 150 MB respectively—significantly more efficient than exact methods for large inputs. Aho‑Corasick excels at simultaneous multi‑pattern detection, locating 100 patterns in a 1 MB sequence within 0.5 s, but the automaton’s state table can exceed 300 MB, rendering it unsuitable for very large pattern libraries.

The discussion synthesizes these findings into practical guidelines. For short sequences with ample memory, exact algorithms such as KMP or Boyer‑Moore are recommended due to their deterministic behavior and straightforward implementation. For large‑scale genomic databases, heuristic tools like BLAST/FASTA strike the best balance between speed, sensitivity, and resource usage. When exhaustive occurrence reporting is mandatory, suffix‑based indexes are optimal provided sufficient memory is available. Multi‑pattern tasks, such as transcription‑factor binding site scans, benefit from Aho‑Corasick, albeit with careful memory management.

In conclusion, the authors affirm that BLAST and FASTA remain the de‑facto standards for biological sequence matching, while future research should focus on integrating compressed indexes (FM‑Index, BWT) with GPU acceleration to further reduce latency and memory demands. They also propose exploring hybrid frameworks that combine deep‑learning similarity scoring with traditional string‑matching pipelines, anticipating that such approaches will enhance both accuracy and scalability in next‑generation sequence analysis.