The Distribution and Deposition Algorithm for Multiple Sequences Sets
Sequences set is a mathematical model used in many applications. As the number of the sequences becomes larger, single sequence set model is not appropriate for the rapidly increasing problem sizes. For example, more and more text processing applicat…
Authors: Kang Ning, Hon Wai Leong
The Distribution and Deposition Algorith m for Multiple Sequences Sets Kang Ni ng 1 and Hon Wai Leong 2 1 Departm ent of Pathology, Uni versit y of Mi chigan, Ann Arbor, MI, USA, 48109 and 2 Department of Compu ter Science, National Univers ity of Singapo re, Singapore 1175 90 kning@u mich.edu and leonghw@com p.nus.edu.sg Abstract “Sequen ces set” is a mathemati cal model used in many app lications such as sch eduling, text proc ess and biolo gical sequen ces analysis. As the num ber of the s equences beco mes larger, “si ngle” sequenc e set model is not app ropriate for t he r apidly in creasing problem sizes. Fo r example, more and more text processing appli cations separate a single b ig text file int o multiple fil es b efore proc essing. F or these appli cations, the unde rline mathemati cal model i s “multiple sequen ces sets” (MSS). Though t here i s in creasing use of MSS, ther e is li ttle resear ch on how to process MSS efficientl y. To pro cess multiple sequences sets, s equences are first distrib uted to di fferent set s, sequ ences for ea ch set are then processed . Deriving effective algorithm for MS S processing is both interesting and challengin g. In t his paper , we tried t o formul ated the pro blem of Process of M ultiple Sequ ences Sets (P MSS) b y first defined t he cost functions and performance ratio. Based o n these, t he P MSS pr oblem is formulated as to mini mize t he t otal cost of pro cess. We h ave p roposed two greed y algori thms for the PMSS problem, whic h are b ased on gen eralizatio n o f algorithms for single sequen ces set. Then based on the anal y si s o f the features of multiple s equences sets, we have proposed the Distri bution and Deposition (DDA) al gorithm and D DA* algorith m for PMSS pro blem. In DDA algorithm, t he sequ ences are first d istributed to multiple sets acco rding to their alph abet contents; then sequences in each set are pr ocessed by dep ositi on al gorithm. The DDA* al gorithm differs from the DDA al gorithm i n th at the DDA* al gorithm distribute s sequen ces b y clusteri ng bas ed on a set of sequen ce fe atures (alp habet content is one of the pr operties). Experiments show that DDA and DDA* a lway s o utput res ults with s maller costs than other algorithms , and DDA* outper forms DD A in most instance s. This i ndicates t hat dist ribution of sequence s to multiple sets according to seq uence featu res before proces sing sequ ences on each set is beneficial. Th e DDA and DDA* algorit hms are also efficient both in time and space. 1. Introduction A seque nces set ( S) contains one or m ore sequences, S={s 1 , s 2 … s N }. Multipl e s equences sets (MSS) contain s one or more such se quences sets, MSS={S 1 , S 2 … S M }. “Mult iple sequen ces sets” is a math ematical model that has its use in man y appli cations, especially in biol ogical sequences an alysis, t ext process and schedulin g. Example 1. Multipl e oligos arrays The s ynthesis of oligos arr ays is import ant to bi ological sequen ces anal ysis. Since e ach of the s ynthesis st eps i s error prone, and each step is also costl y i n manu facture p rocess, the ol igos ha ve t o be deposited onto the arr ay effect ively (i .e. sh ort s y nthe sis sequences are required ). As the number of oligos u sed in microarr ay experim ents is i ncreasing, si ngle array is not capabl e for so many oli gos, and th ese oli gos ar e to be deposited onto different arrays. Ther efore, a met hod to distrib ute the se oli gos onto different arra y s, an d deposit oligos on re spective arrays so t hat the total s ynthesis cost in minimized is crit ical. Example 2. Text proces sing (data transfer) In data t ransfer, when single chann el is not able to transfer huge amount of data efficientl y , multiple chann els comes t o use. For thes e multipl e chann el sy st ems, t he app ropriate distrib ution of data t o diff erent chann els, and effective p rocess to trans fer these data though each cha nnel is important. Example 3. Schedul ing In a flexible jo b sho p p roblem [ 1], t here are man y jobs , and ea ch of th em has different steps. Usually these jobs can b e per formed o n a single workstation . However, as the scale of this problem becomes so large that one workstatio n is not suitable, th e flexible job shop problem with distrib uted sched uling o n multipl e workstation s is becomin g necessary [1], and the total cost is to be minimized. Thes e applications ar e based o n the same mat hematical m odel of multipl e sequences set. And th e unde rline probl em is t he p rocess of multiple s equences sets: given a set of sequences and a fixed nu mber M , how to process all of the sequen ces on these M sets efficientl y. Th ey al l have to accomplish t wo t asks: distrib utes sequenc es to different sets ( distribution proc ess), and process the sequences for ea ch s et (depositio n pro cess). The objective of the whole p rocess is to minimiz e the total “cost” o f pro cessing for all of thes e s equences on multiple se ts (details in “problem fo rmulation”) . W e refer to these probl ems as the proble m of Process of Multiple Seq uences Sets (PMSS). It is worthwhile t o n ote t hat th e size limi tation of s ingle set is not the on ly reaso n that multiple sequenc es set come to use. It will b e s hown in our analysis that even in situatio ns when all sequences can fit into a single set , switching to multipl e sequences model could si gnificantly reduce p rocess cost. The P MSSS proble m is related to th e SCS problem , since for each sequences set , the minimum number of p rocess steps is actually the length of the Shortest Com mon Supersequ ence (SCS) o f these sequen ces. SCS problem for single sequences s ets The problem of finding t he Sh ortest Common Supersequ ence (SCS) of a given set o f sequ ences i s a very impo rtant prob lem in sequen ce anal ysis. SCS of a set of sequen ces can be defined as follows: Given two sequenc es s = s 1 s 2 …s m and t=t 1 t 2 …t n , over an alphabet set ∑ ={σ 1 , σ 2 , …,σ q }, we say that s is the subseq uence of t ( and e quivalentl y, t is the s upersequence of s ) if for ever y s j , there is j i j t s = for so me 1≤ i 1 < i 2 <…
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment