The Distribution and Deposition Algorithm for Multiple Sequences Sets

Reading time: 7 minute
...

📝 Original Info

  • Title: The Distribution and Deposition Algorithm for Multiple Sequences Sets
  • ArXiv ID: 0904.1242
  • Date: 2009-04-07
  • Authors: Kang Ning, Hon Wai Leong

📝 Abstract

Sequences set is a mathematical model used in many applications. As the number of the sequences becomes larger, single sequence set model is not appropriate for the rapidly increasing problem sizes. For example, more and more text processing applications separate a single big text file into multiple files before processing. For these applications, the underline mathematical model is multiple sequences sets (MSS). Though there is increasing use of MSS, there is little research on how to process MSS efficiently. To process multiple sequences sets, sequences are first distributed to different sets, and then sequences for each set are processed. Deriving effective algorithm for MSS processing is both interesting and challenging. In this paper, we have defined the cost functions and performance ratio for analysis of the quality of synthesis sequences. Based on these, the problem of Process of Multiple Sequences Sets (PMSS) is formulated. We have first proposed two greedy algorithms for the PMSS problem, which are based on generalization of algorithms for single sequences set. Then based on the analysis of the characteristics of multiple sequences sets, we have proposed the Distribution and Deposition (DDA) algorithm and DDA* algorithm for PMSS problem. In DDA algorithm, the sequences are first distributed to multiple sets according to their alphabet contents; then sequences in each set are deposited by the deposition algorithm. The DDA* algorithm differs from the DDA algorithm in that the DDA* algorithm distributes sequences by clustering based on sequence profiles. Experiments show that DDA and DDA* always output results with smaller costs than other algorithms, and DDA* outperforms DDA in most instances. The DDA and DDA* algorithms are also efficient both in time and space.

💡 Deep Analysis

Deep Dive into The Distribution and Deposition Algorithm for Multiple Sequences Sets.

Sequences set is a mathematical model used in many applications. As the number of the sequences becomes larger, single sequence set model is not appropriate for the rapidly increasing problem sizes. For example, more and more text processing applications separate a single big text file into multiple files before processing. For these applications, the underline mathematical model is multiple sequences sets (MSS). Though there is increasing use of MSS, there is little research on how to process MSS efficiently. To process multiple sequences sets, sequences are first distributed to different sets, and then sequences for each set are processed. Deriving effective algorithm for MSS processing is both interesting and challenging. In this paper, we have defined the cost functions and performance ratio for analysis of the quality of synthesis sequences. Based on these, the problem of Process of Multiple Sequences Sets (PMSS) is formulated. We have first proposed two greedy algorithms for the

📄 Full Content

A sequences set (S) contains one or more sequences, S={s 1 , s 2 … s N }. Multiple sequences sets (MSS) contains one or more such sequences sets, MSS={S 1 , S 2 … S M }. "Multiple sequences sets" is a mathematical model that has its use in many applications, especially in biological sequences analysis, text process and scheduling.

The synthesis of oligos arrays is important to biological sequences analysis. Since each of the synthesis steps is error prone, and each step is also costly in manufacture process, the oligos have to be deposited onto the array effectively (i.e. short synthesis sequences are required). As the number of oligos used in microarray experiments is increasing, single array is not capable for so many oligos, and these oligos are to be deposited onto different arrays. Therefore, a method to distribute these oligos onto different arrays, and deposit oligos on respective arrays so that the total synthesis cost in minimized is critical.

Example 2. Text processing (data transfer)

In data transfer, when single channel is not able to transfer huge amount of data efficiently, multiple channels comes to use. For these multiple channel systems, the appropriate distribution of data to different channels, and effective process to transfer these data though each channel is important.

Example 3. Scheduling In a flexible job shop problem [1], there are many jobs, and each of them has different steps. Usually these jobs can be performed on a single workstation. However, as the scale of this problem becomes so large that one workstation is not suitable, the flexible job shop problem with distributed scheduling on multiple workstations is becoming necessary [1], and the total cost is to be minimized. These applications are based on the same mathematical model of multiple sequences set. And the underline problem is the process of multiple sequences sets: given a set of sequences and a fixed number M, how to process all of the sequences on these M sets efficiently. They all have to accomplish two tasks: distributes sequences to different sets (distribution process), and process the sequences for each set (deposition process). The objective of the whole process is to minimize the total “cost” of processing for all of these sequences on multiple sets (details in “problem formulation”). We refer to these problems as the problem of Process of Multiple Sequences Sets (PMSS).

It is worthwhile to note that the size limitation of single set is not the only reason that multiple sequences set come to use. It will be shown in our analysis that even in situations when all sequences can fit into a single set, switching to multiple sequences model could significantly reduce process cost.

The PMSSS problem is related to the SCS problem, since for each sequences set, the minimum number of process steps is actually the length of the Shortest Common Supersequence (SCS) of these sequences.

The problem of finding the Shortest Common Supersequence (SCS) of a given set of sequences is a very important problem in sequence analysis. SCS of a set of sequences can be defined as follows: Given two sequences s = s 1 s 2 …s m and t=t 1 t 2 …t n , over an alphabet set ∑={σ 1 , σ 2 , …,σ q }, we say that s is the subsequence of t (and equivalently, t is the supersequence of s) if for every s j , there is j i j t s = for some 1≤i 1 < i 2 <…<i m ≤n. Given a finite set of sequences S={s 1 ,s 2 ,…,s k }, a common supersequence of set S is a sequence t such that t is a supersequence of every sequence s j (1≤j≤k)f in S. Then, the shortest common supersequence (SCS) of S is a supersequence of S that has minimum length.

The SCS problem has found diverse application in many areas, including data compression [2], scheduling [3], query optimization [4], file comparison [5] and biological sequence analysis [5,6]. The exact solution for the SCS problem can be computed by dynamic programming. However, the SCS problem on arbitrary number of sequences is NP-hard [7], so many heuristic algorithms have been proposed that give heuristic results for the SCS problem.

A trivial algorithm, called Alphabet [8], gives an approximation ratio of |∑| for SCS problem by using the periodic supersequence S ps =(α 1 α 2 …α |∑| ) K , where K is the maximum length of sequences, and ∑={α 1 ,α 2 , …α |∑| } is the alphabets set. However, it does not perform well in practice. Other heuristic algorithms include Majority Merge [7], Tournament [9], Greedy [9], and Reduce-Expand [8]. Several heuristic algorithms were also proposed specifically for computing the SCS of DNA sequences (with |∑| = 4). These include Min-Height (MH) [10], Sum-Height (SH) [10] heuristics. (Interestingly, Majority Merge [7] and Sum-Height [10] are the same algorithm.) Recently, we also proposed look-ahead extensions of SH heuristic [11], as well as a post-processing reduction procedure (LAP algorithm) [11] for SCS problem. We studied the performance of these algorithms on DNA sequences, and ana

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut