Fast low-level pattern matching algorithm

Reading time: 5 minute
...

📝 Abstract

This paper focuses on pattern matching in the DNA sequence. It was inspired by a previously reported method that proposes encoding both pattern and sequence using prime numbers. Although fast, the method is limited to rather small pattern lengths, due to computing precision problem. Our approach successfully deals with large patterns, due to our implementation that uses modular arithmetic. In order to get the results very fast, the code was adapted for multithreading and parallel implementations. The method is reduced to assembly language level instructions, thus the final result shows significant time and memory savings compared to the reference algorithm.

💡 Analysis

This paper focuses on pattern matching in the DNA sequence. It was inspired by a previously reported method that proposes encoding both pattern and sequence using prime numbers. Although fast, the method is limited to rather small pattern lengths, due to computing precision problem. Our approach successfully deals with large patterns, due to our implementation that uses modular arithmetic. In order to get the results very fast, the code was adapted for multithreading and parallel implementations. The method is reduced to assembly language level instructions, thus the final result shows significant time and memory savings compared to the reference algorithm.

📄 Content

Fast low-level pattern matching algorithm Janja Paliska Soldoa, Ana Sović Kržićb* and Damir Seršićc aHashCode, Frana Folnegovica 1c, 10000 Zagreb, Croatia, janjapaliska@gmail.com bUniversity of Zagreb Faculty of Electrical Engineering and Computing, 10000 Zagreb, Croatia, ana.sovic.krzic@fer.hr cUniversity of Zagreb Faculty of Electrical Engineering and Computing, 10000 Zagreb, Croatia, damir.sersic@fer.hr

Corresponding author. Tel.: +385-1-6129-883; fax: +385-1-6129-652

ABSTRACT This paper focuses on pattern matching in the DNA sequence. It was inspired by a previously reported method that proposes encoding both pattern and sequence using prime numbers. Although fast, the method is limited to rather small pattern lengths, due to computing precision problem. Our approach successfully deals with large patterns, due to our implementation that uses modular arithmetic. In order to get the results very fast, the code was adapted for multithreading and parallel implementations. The method is reduced to assembly language level instructions, thus the final result shows significant time and memory savings compared to the reference algorithm.

Keywords: DNA sequence; pattern matching; modular arithmetic; human genome; prime number coding

  1. Introduction
    Pattern matching problem is a frequent problem in various fields of science. In bioinformatics, it has large usage for the DNA sequencing, which is an important task in genomics. 1.1. Pattern matching problem Exact pattern matching problem, in general, finds all subsets of text t that are equal to the given pattern p. Let us denote text length as n and pattern length as m (n > m). Both t and p are strings over a finite alphabet  = {a1,…, aσ}. Pattern p is said to be found on location i in t if 1 ≤ j ≤ m, p[j] = t [i+j-1] is fulfilled [1]. Pattern matching with character classes allows p to be consisted of character classes such that p[i]  . Here, p is found on locations i if 1 ≤ j ≤ m, t[i+j-1]  p[j] is fulfilled. For example, pattern p = b[abc]cc[bc] in given text t = ‘accabbccba’ occurs on location 5 (substring ‘bbccb’).
    Approximate pattern matching extends the problem so that there are at most k allowed mismatches between t and p. In the previous example, if we set k = 2, there are more occurrences of p in t (mismatches are underlined): location i=1 (substring ‘accab’), location i=4 (substring ‘abbcc’) and location i=6 (substring ‘bccba’). To find all occurrences of the pattern in a given text, it is usually scanned by the sliding window of pattern length size m [2]. 1.2. Literature overview
    The brute force algorithm for exact pattern matching has computational time O(mn). In the worst case, m characters at each of the n letters of the text must be checked.
    Our paper is inspired by work of Linhart and Shamir [1]. More details are given in paragraph 1.3. Here, we present a brief overview of other competing pattern matching methods. Boyer-Moore algorithm is an exact, single pattern matching method. It starts with the last character of the pattern and aligns it to the first appearing in text. If the mismatch between other characters is detected, pattern is shifted right so the detected mismatch character is aligned to the right most occurrence of it in the pattern. The algorithm uses bad character rule and good suffix rule. It has the worst case complexity O(m+n) and of the best caseO(n/m) [3]. Horspool simplified the Boyer-Moore algorithm. It uses only the bad character rule on the rightmost character of the window to compute the shifts in the Boyer- Moore algorithm [4]. The Sunday’s quick-search algorithm generates bad character shift table during the preprocessing stage. During the search algorithm, the pattern symbol can be compared in an arbitrary order [5]. Sheik’s algorithm reduces character comparison. The last character of the window and the pattern is compared. If there is a match, the algorithm compares the first character of the window to the pattern. The remaining characters are compared from right to the left until mismatch or complete match occurs. The window shift is provided by the quick-search bad character placed next to the window [6]. Time complexity in the best case is O(n/(m+1)), and in the worst case is O(m(n-m+1)). The preprocessing phase time complexity is O(m+σ) and the space complexity is O(σ), whereσ is the alphabet size. Bhukya and Somayajulu create an array of indices of character pairs to reduce the number of comparisons, as well as an array of frequencies of each pair. Index table for the pattern is created in the same way. Search starts with a pair in the text with the least frequency. By using the index, the possible location of the pattern in the sequence is found. The rest of the pattern is compared sequentially [7].
    A bit different approach is using keyword and suffix trees. Keyword tree is built

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut