Algorithms of self-synchronizing single-deletion-correcting codes
This study explores the self-synchronization problem in DNA coding, specifically addressing single-deletion errors without using delimiters between codewords. We aim to identify the beginning of each codeword without using delimiters, enhancing the t…
Authors: Whan-Hyuk Choi
ALGORITHMS OF SELF-SYNCHR ONIZING SINGLE-DELETION-CORRECTING CODES WHAN-HYUK CHOI A B S T R AC T . This study explores the self-synchronization problem in DNA cod- ing, specifically addressing single-deletion errors without using delimiters be- tween code words. W e aim to identify the beginning of each codeword with- out using delimiters, enhancing the transmission efficienc y . The motiv ation arises from the inefficiency of adding meaningless symbols as delimiters, de- creasing the information rate. In addition, the historical context in biology , specifically Francis Crick’ s proposal of “codes without commas” for DNA se- quences, inspires this in vestigation. W e introduce a novel approach for correct- ing single-deletion errors in continuous transmissions without delimiters, distin- guishing the beginning and end of each codeword. This approach is based on the properties of complementary information set codes , which is used to present an algorithm for single-deletion correcting codes with self-synchronizing ca- pability . Accordingly , we present encoding and decoding algorithms for self- synchronizing single-deletion correcting DN A codes with concrete examples. K eywords: Single-deletion-correcting code, DNA codes, re versible self-dual code, self-synchronizing block codes, and non-binary code. 1. I N T R O D U C T I O N Since the inception of the coding theory , error-correcting codes hav e focused on substitution errors that in v olve symbol changes [10, 17]. Ho wev er , practical communication scenarios often in volve a spectrum of errors be yond symbol sub- stitutions, encompassing deletions, insertions, and erasures, collectiv ely known as synchr onization err ors [15, 19]. In our previous work [3, 12], we studied deletion- error-correcting codes and their application to DNA codes. Moreov er , we intro- duce a construction algorithm for single-single-deletion-corr ecting (SDC) DNA codes [3]. When transmitting encoded data, it is assumed that the sender and re- cei ver kno w the length as well as the beginning and end of each codeword. The start and end of each codew ord can be distinguished by the proper placement of delimiters ; for e xample, a fix ed number of zeroes are placed between each code- word. Under this assumption, we proposed an algorithm for correcting single in- sertion/deletion errors in [3]. Ho wev er , adding meaningless symbols, such as commas, to distinguish the be- ginning and end of a codew ord reduces transmission efficienc y as the overall length of the information increases and the information rate of the code decreases. Thus, we aim to answer the follo wing question: How can we identify the beginning of 1 2 each code word without delimiters placed between the code words, e ven if they suf- fer from deletion errors? W e call this the self-synchr onization pr oblem . For instance, suppose that one receiv es a sequence of DN A codes of length 8, separated by a delimiter , symboled by |, as follo ws: GAT C C T AG | AGT T AC T | GGAT T GT C | GGGT T GG | C GT C C T GC . Because the delimiter’ s positions are kno wn, the receiver can tell that this sequence has 5 code words and that a single deletion error has occurred in the second and fourth codew ords because they ha ve only se ven symbols. This type of error can be corrected using the algorithm in [12]. Howe ver , with no delimiter for each code word, the recei ver recei v es a sequence of 38 consecutiv e symbols, as follo ws: GAT C C T AGAGT T AC T GGAT T GT C GGGT T GGC GT C C T GC . In this case, at least two deletion errors can be noticed when checking the se- quence’ s length. Howe ver , it is impossible to determine the positions of the dele- tion errors because the receiver does not kno w where each code word begins or ends. This study introduces a nov el approach for correcting a single-deletion error , e ven when transmitting codew ords continuously without delimiters, by distinguish- ing the beginning and end of each codeword. Discarding delimiters is the primary distinction between this and earlier studies on synchronization errors. Interestingly , the first self-synchronization problem emerged in biology shortly after the disco very of deoxyribonucleic acid (DNA) by one of its disco verers, Fran- cis Crick [5]. Crick aimed to solv e a mathematical issue concerning DN A se- quences in connection with protein synthesis. In [5], Crick proposed codes without commas (equiv alently , codes without delimiters), which refer to DNA codes com- posed of three DN A symbols encoding each amino acid; this solved a sort of syn- chronization problem of DN A sequences. Although Crick’ s solution was prov ed biologically wrong, his idea was dev eloped by mathematicians interested in the synchronization problem [7, 16, 18]. In this study , we in v estigated the self-synchronization problem of DN A codes. W e introduced encoding and decoding algorithms for SDC DNA codes with self- synchronizing capabilities. These algorithms detect the location of a single deletion error in each DN A codew ord, correct the error , and distinguish the beginning and end of each DN A codew ord. Even though there has been extensi ve research on deletion error correction, especially in the context of DN A sequences, almost of them has assumed the presence of delimiters between codew ords [8, 9, 21]. There- fore, as far as we kno w , this is the first study to provide explicit algorithms for self-synchronizing SDC DN A codes without using delimiters. The remainder of this paper is or ganized as follo ws. W e begin with the prelimi- naries in Section 2. In addition, we summarize crucial results from [3] and [12] and discuss the properties of DN A codes. Theorem 17, Algorithm 15, and Algorithm 16 are presented in Section 3. Section 4 describes the implementation of the nov el algorithm. 3 2. P R E L I M I N A R I E S 2.1. Single-deletion-correcting codes. Let F q be the finite field of order q for a prime po wer q . A subset C of F n q is called a code of length n over F q . In particular , if C is a k -dimensional subspace of F n q , C is called a q -ary linear code of length n and dimension k , which we denote as [ n, k ] q code . Each element of a code is called a code wor d . A code C of length n is called systematic if there exists a subset I of { 1 , 2 , . . . , n } (called an information set of C ) such that e very possible tuple of length | I | occurs in exactly one code word in C within the specified coordinates x i ; i ∈ I [4, 13]. If C is a systematic code with information set I of size k , then there e xists a one-to-one correspondence between F k q and distinct q k code words in C whose coordinates are contained in I . If the first k coordinates form the information set, the code has a unique generator matrix of the form ( E k | A ) , where E k is a k × k identity matrix and A is a k × n − k matrix. Such a generator matrix is said to be in standar d form . A complementary information set (CIS) code is a special type of systematic code: a CIS code C of length n ov er F q is a [2 n, n ] q code which has two disjoint information sets I and J , each of size n . In other w ords, ev ery vector in F n q appears exactly once in the coordinates of C restricted to I , and also exactly once in the coordinates restricted to J . F or details of the CIS codes, refer [4]. The follo wing lemma characterizes CIS codes ov er F q . Lemma 1. [4, Lemma 4.1] If a [2 n, n ] code C ov er F q has generator matrix ( I | A ) with A inv ertible, then C is a CIS code with the systematic partition. Con versely , e very CIS code is equi v alent to a code with generator matrix in that form. Example 2. Consider a [4 , 2] 3 linear code C 1 having the generator matrix in stan- dard form 1 0 1 1 0 1 1 0 , which means C 1 = { 0000 , 1011 , 0110 , 1121 , 2022 , 0220 , 2212 , 1201 , 2102 } . It is easy to check that all nine possible vectors in F 2 3 appear in the first two co- ordinates, as well as in the last two coordinates. Since both the first and last two coordinates form disjoint information sets of size two, C 1 is a CIS code. Let x be a code word in code C of length n . If a vector y ∈ F n − 1 q is obtained from x by deleting one symbol of x , then y is called a subwor d of x . W e denote a subword y of x as y = x i if the i th symbol of x is deleted. Let D 1 ( x ) denote a set of subwords in x . A code C is said to be an SDC code if D 1 ( x 1 ) ∩ D 1 ( x 2 ) = ∅ for all x 1 , x 2 ∈ C , x 1 = x 2 . Let u and v be two code words in code C . The Levenshtein distance d l ( u , v ) between u and v is defined as the smallest number of insertions and deletions required to transform u into v : Lev enshtein distance is a metric. The minimum Levenshtein distance of C , denoted by d l ( C ) , is the smallest Le v enshtein distance between distinct codewords in C . The Hamming distance d h ( u , v ) between u and 4 v is defined as the number of coordinates in which x and y dif fer . The minimum Hamming distance of C , denoted by d h ( C ) , is the smallest Hamming distance be- tween the distinct codewords in C . Code C can correct t substitution errors only if d h ( C ) > 2 t . Similarly , code C can correct e deletion/insertion errors if and only if d l ( C ) > 2 e . Next, we introduce some results of our pre vious study [3] without proof. The follo wing theorem and remark are the main results from [3], which show that an SDC code can be created from a CIS code by inserting tw o identical symbols with a specific rule. Theorem 3. [3, Theorem 3.5] Let C be a CIS code of length 2 n ov er F q and let ϕ : F 2 n q → F q be a map defined by ϕ ( x ) = x n + 1 , where x = ( x i ) ∈ F 2 n q . W e also define a vector x ϕ = ( x 1 , · · · , x n , ϕ ( x ) , ϕ ( x ) , x n +1 , · · · , x 2 n ) ∈ F 2 n +2 q obtained by adding two ϕ ( x ) ’ s between the n th and ( n + 1) th positions of x for e very code word x in C . Then the set of vectors x ϕ for all codewords x in C , that is, C ϕ = { x ϕ | x ∈ C } , is an SDC code. Example 4. Consider the code C 1 = { 0000 , 1011 , 0110 , 1121 , 2022 , 0220 , 2212 , 1201 , 2102 } in Example 20, which consists of nine code words. The minimum Lev enshtein distance d l ( C 1 ) is 2 since Le v enshtein distance between code words 1011 and 0110 is 2. By applying the map ϕ of Theorem on C 1 , we obtain C 1 ϕ = { 001100 , 101111 , 012210 , 112221 , 201122 , 020020 , 220012 , 120001 , 212202 } . It is routine to check that d l ( C 1 ϕ ) = 4 and C 1 ϕ is an SDC code. Remark 5. If ϕ is defined by ϕ ( x ) = x n + a for any fixed nonzero element a in F q , then set C ϕ is an SDC code. If we define ϕ such that the image ϕ ( x ) is dif ferent from x n for e very code word x in CIS code C , then we can obtain SDC code C ϕ . As Remark 5 points out, the ke y to constructing the SDC code from CIS codes of length 2 n in volv es choosing the symbol ϕ ( x ) that is different from the n -th symbol x n of each codeword x of the CIS code and inserting it twice in the middle of the code word. F or proof and details, refer [3]. The follo wing is from Algorithm 3.7 in [3]. The algorithm focuses on single- deletion-error correction in a recei ved vector with delimiters from a codeword in C ϕ where C is a CIS code of length 2 n . 5 Algorithm 6 Decoding algorithm for single-deletion-error correction in [3] Require: a receiv ed vector x through a single-deletion channel from C ϕ ov er F q Ensure: the codew ord c in C decoded from x 1: 2 n ← the length of C ; L ← the length of x 2: if L = 2 n + 1 then 3: decompose x = u ⊕ ( x n +1 ) ⊕ v , where u , v ∈ F n q . 4: ϕ ← x n +1 5: u n ← the last symbol of u 6: if u n = x n +1 then 7: there is no deletion in v 8: take y = v with the set I = { n + 1 , . . . , 2 n } . 9: else if u n = x n +1 then 10: there is no deletion in u 11: take y = u with the set I = { 1 , . . . , n } . 12: end if 13: obtain c generated by y with the information set I . 14: else if L = 2 n + 2 then 15: decompose x = u ⊕ ( x n +1 , x n +2 ) ⊕ v , where u , v ∈ F n q . 16: ϕ ← x n +1 17: c ← u ⊕ v 18: end if 19: return c , ϕ Compared to [3], this study focuses on the self-synchronization problem: How can we identify the beginning of each code word when there are no delimiters be- tween codewords and the transmitted data suffer from deletion errors? Some defi- nitions must be clarified to answer this question. Definition 7. Let C be a code and ( c 1 , c 2 , · · · , c m ) be a sequence of m codew ords in C . If we delete all delimiters between codewords, we denote the sequence with- out the delimiter simply by c 1 c 2 · · · c m . If there exists a proper algorithm that con v erts the sequence c 1 c 2 · · · c m without delimiters back to the original sequence ( c 1 , c 2 , · · · , c m ) , then the sequence is called a self-synchr onizing sequence. Definition 8. Let C be a code and c 1 c 2 · · · c m be a sequence of m code words in C without delimiters. When deletion errors in c 1 c 2 · · · c m occur at most once for each codew ord but not consecutively , we call these errors single-deletion-err ors . If there exists an appropriate algorithm that can correct single-deletion-errors in a sequence c 1 c 2 · · · c m and simultaneously con vert the sequence c 1 c 2 · · · c m without delimiters back to the original sequence ( c 1 , c 2 , · · · , c m ) , then the sequence is called a self-synchr onizing single-deletion-corr ecting sequence . Definition 9. If ev ery sequence made from codew ords in C with an appropri- ate algorithm is a self-sync hr onizing SDC sequence , then code C is called a self- synchr onizing SDC code . 6 2.2. DN A codes. Deoxyribonucleic acid (DN A) encodes genetic information of life in the DN A helix with four basic units called nucleotides : Adenine( A ), Cytosine( C ), Guanine( G ) and Thymine( T ). The DN A helix is a double strand b uilt by join- ing four nucleotides and complementary base pairing, which connects the W at- son–Crick complement , denoted by A c = T , T c = A , C c = G and G c = C . A DN A code of length n is a set of tuples ( x 1 , · · · , x n ) , where x i ∈ { A, C, G, T } . A DN A codewor d is an element of DNA code. W e call a continuous sequence of DN A code words a DN A strand . The GC -weight of the DNA code word is the number of occurrences of C and G in the code word. DN A symbols can be identified with tw o-digit binary numbers under the map δ : F 2 × F 2 → { A, C, G, T } defined by δ (00) = A, δ (11) = T , δ (10) = C and δ (01) = G. Using this map δ , we can encode a binary sequence of length 2 n into a DN A se- quence of length n . For example, a binary sequence of length 20, 10011011011100100111 , is encoded into the DN A sequence C GC T GT AC GT . Another method is to identify DN A symbols with four elements of F 4 under the bijection µ : F 4 → { A, C, G, T } defined by µ (0) = A, µ (1) = T , µ ( ω ) = C and µ ( ¯ ω ) = G. The map µ defines the complement of the elements of F 4 to be compatible with the W atson-Crick complement : the complement of x in F 4 is denoted by x c = x + 1 . W e enlarged this map µ to a vector and code. Thus, a DN A code can be identified using a code over F 4 under the bijection µ . For a vector x = ( x 1 , · · · , x n ) ∈ F n 4 , we denote the complement of x by x c = ( x c 1 , · · · , x c n ) and r everse of x by x r = ( x n , · · · , x 1 ) . The r ever se-complement of x is denoted by x rc = ( x c n , · · · , x c 1 ) . A vector x is called self-re versible ( self-re verse-complementary ) if x = x r ( x = x rc ). For computation, we define an injection map τ : F 4 → M 2 ( F 2 ) , where M 2 ( F 2 ) be a set of 2 × 2 binary matrices: τ (0) = ( 0 0 0 0 ) , τ (1) = ( 1 0 0 1 ) , τ ( ω ) = ( 0 1 1 1 ) , a nd τ ( ¯ ω ) = ( 1 1 1 0 ) , Therefore, we can represent a DN A symbol by a matrix in M 2 ( F 2 ) using the com- posite map τ ◦ µ − 1 . W e denote the map τ ◦ µ − 1 and its in verse map under restriction by f and f − 1 , respecti vely . Ne xt, we enlarge these maps to a vector , sequence, and code. F or example, f ( AC C T G ) = ( 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 0 1 1 0 ) Regarding the properties of good DNA codes, refer [3, 6]. Giv en positiv e inte- gers t and e , the following fi ve constraints are of concern when designing a DNA code D : - Hamming distance constraint ( HD ): d h ( x , y ) ≥ t for all x , y ∈ D with x = y . - Re v erse constraint ( RV ): d h ( x , y r ) ≥ t for all x , y ∈ D . - Re v erse-complement constraint ( RC ): d h ( x , y rc ) ≥ t for all x , y ∈ D . 7 - Fix ed GC-content constraint ( GC ): GC-weights of all codew ords of D are constant. - Deletion/insertion constraint ( DI ) : d l ( x , y ) ≥ e for all x , y ∈ D with x = y . In the following section, we provide a solution to the self-synchronization prob- lem - how can we identify the be ginning of each code word with possible errors when no commas are separating them? For the solution, we start with Theorem 4 and take them one step further . Our approach in volv es selecting a symbol ϕ ( x ) under certain additional conditions to construct a self-synchronizing SDC code. Thus, we can distinguish the be ginning and end of each code word in an SCD code without using commas. Furthermore, we proposed an algorithm to detect and cor- rect one deletion error in each code word. 3. S E L F - S Y N C H R O N I Z I N G S D C D NA C O D E S Hereafter , we focus on the DN A codes. In the later parts of this paper , we assume that C is a CIS code of length 2 n over F 4 for some inte ger n . A DNA code induced from C is denoted by D = µ ( C ) ; howe ver , we would often identify C and D . The following theorem is a variation of Theorem 4 in [3]. W e use this as a stepping-stone for constructing self-synchronizing SDC DN A codes. Theorem 10. Let C be a CIS code of length 2 n over F q . Let ψ : F q × F q → F q be a map satisfying the follo wing conditions: i) ψ ( a, b ) = a and ψ ( a, b ) = b ii) ψ ( a + 1 , b + 1) = ψ ( a, b ) + 1 W e also define a vector x ψ ∈ F 2 n +2 q as ( x 1 , · · · , x n , ψ ( x n , x n +1 ) , ψ ( x n , x n +1 ) , x n +1 , · · · , x 2 n ) , obtained by adding symbol ψ ( x n , x n +1 ) twice between the n th and ( n + 1) -th po- sition of x for ev ery code word x in C . Then the set of vectors x ψ for all codewords x in C , that is, C ψ = { x ψ | x ∈ C } is an SDC code. Pr oof. Based on the condition of the map ψ , symbol ψ ( x n , x n +1 ) is alw ays dif- ferent from x n for ev ery codeword x in C . Thus, a similar reasoning as that of the proof of Theorem 4 and remark 5 prov es the theorem. ■ The follo wing definition provides a map ψ satisfying Theorem 10 when q = 4 : 8 Definition 11. W e define a map ψ that satisfies the conditions in Theorem 10 as map ψ : F 4 × F 4 → F 4 by ψ ( x, y ) = x + y + ω if x + y ∈ { 0 , 1 } and x, y ∈ { 0 , 1 } , x + y if x + y ∈ { 0 , 1 } and x, y ∈ { ω , ¯ ω } , x + y if x + y ∈ { ω , ¯ ω } and xy = 0 , 1 if x + y ∈ { ω , ¯ ω } and xy = 0 . W e also present the values of map ψ in T able 1. T A B L E 1 . Images of map ψ ( x, y ) x \ y 0 1 ω ¯ ω 0 ω ¯ ω 1 1 1 ¯ ω ω ¯ ω ω ω 1 ¯ ω 0 1 ¯ ω 1 ω 1 0 W e can easily verify that ψ ( a, b ) = a , ψ ( a, b ) = b , and ψ ( a + 1 , b + 1) = ψ ( a, b ) + 1 for each a and b in F 4 , which are the conditions in Theorem 10. The follo wing example moti v ates the main idea behind our approach. Example 12. Consider a simple DN A code D 1 with four code words of length six: D 1 = { AAC C AA, T AT T C G, C AT T GT , GAGGT C } . This DN A code D 1 is a subcode of µ ( C ψ ) for a CIS code C of length four over F 4 with generator matrix ( 1 0 ω ¯ ω 0 1 ¯ ω ω ) and the map ψ is from Definition 11. Thus, it is easy to verify that D 1 is an SDC code. Suppose some data are encoded in a DN A strand made from D 1 . C A T T GT , GAG G T C, C AT T G T , T AT T C G, A A C C AA. Each code word is separated by commas that act as delimiters. No w , assume that there are no commas and each codeword allows a single-deletion error so that the red symbols are deleted. Then, the DN A strand becomes C T T GT GAGT C C AT T GT AT C GAC C AA. Ho w can we decode this DNA strand to recov er its original form? Firstly , we exam- ine the first six symbols C T T GT G , and decode this to the codeword C A T T GT using Algorithm 6: C A T T GT , GAGT C C AT T GT AT C GAC C AA. Then, we notice that symbol G , the last symbol of C T T GT G , is the first symbol of the second possible codeword GAGT C C . Next, GAGT C C is decoded to a code word GAGGT C in D 1 using Algorithm 6: C A T T GT , GAG G T C, C AT T GT AT C GAC C AA. 9 Regarding the third codew ord C AT T GT , there is a problem: we cannot notice the deletion of the last symbol T of C AT T G T because the fourth codew ord also begins with the symbol T . If we admit the third code word to be C AT T GT , the corrected DN A strand becomes C A T T GT , GAG G T C, C AT T G T , AT C GAC C AA. Consequently , because the remaining strand AT C GAC C AA is considered to hav e three deletions in two code words, we cannot decode the remaining sequences cor- rectly . From the pre vious example, we make the follo wing observations. 1) Observ ation 1: If it is possible to decode the DNA strand code word by code word, from the first to the last in turn, then the whole DNA strand may be decoded. 2) Observ ation 2: If the last symbol of the previous code word is identical to the first symbol of the current code word, there may be confusion about the beginning of the code word. Moti vated by these observations, we propose three novel algorithms, Algorithms 13, 15 and 16, for encoding and decoding self-synchronizing SDC DN A sequences. The key to our algorithms is adding an all-one vector 1 to make a DN A codew ord complementary whenev er it has the first symbol identical to the last symbol of the pre vious codew ord, pre venting confusion in Observ ation 2. In the pseudo-algorithms of Algorithms 13, 15 and 16, we assume that C is a CIS code of length 2 n o ver F 4 having an all-one vector 1 as a codeword, c i , 1 ≤ i ≤ m , are codewords in C , and ψ is a map that satisfies the conditions in Theorem 10. Σ[ i ] denotes the i -th element of Σ and Σ[ i..k ] denotes the subsequence of Σ made from i th to j th consecutiv e symbols of Σ . Since we assume no delimiters, we admit that e very subsequence of length 2 n +2 is a possible code word. Therefore, we modify Algorithm 6 and propose Algorithm 13 for detecting and correcting a single-deletion error in a subsequence of length 2 n + 2 . Using Algorithm 13 and Theorem 10, we prov e the follo wing. Theorem 14. Let ψ be a map satisfying the conditions in Theorem 10, C be a CIS code of length 2 n over F 4 having an all-one vector as codewords, and C ψ be the set of vectors x ψ for all code words x in C . Suppose that a sequence of m code words in C , ( c 1 , c 2 , · · · , c m ) , is encoded to a continuous sequence of codewords in C ψ , d 1 d 2 · · · d m without delimiters, per the follo wing encoding rules: i) d 1 = c 1 ψ . ii) F or i ≥ 2 , d i = c i ψ if the last symbol of c i − 1 is not equal to the first symbol of c i . iii) F or i ≥ 2 , d i = c i ψ + 1 if the last symbol of c i − 1 is equal to the first symbol of c i , where 1 is the all-one vector of length 2 n + 2 . Then, the encoded sequence d 1 d 2 · · · d m without delimiters is a self-synchronizing SDC sequence. 10 Algorithm 13 Decoding algorithm for a recei ved v ector Require: a receiv ed vector x through a single-deletion channel from C ψ of length 2 n + 2 . Ensure: the codew ord c in C decoded from x with values of ψ = x n +1 and is _ del . 1: L ← the length of x 2: if L = 2 n + 1 then ; is _ del ← tr ue 3: decompose x = u ⊕ ( x n +1 ) ⊕ v , where u and v are vectors of length n . 4: ϕ ← x n +1 5: if x n +1 = x n +2 then 6: there is a single deletion in the position of [1 ..n + 2] ; 7: obtain c generated by x [ n + 2 .. 2 n + 1] with the information set { n + 1 , . . . , 2 n } . 8: else if x n +1 = x n +2 then 9: there is no single-deletion in u 10: obtain c generated by u with the information set { 1 , . . . , n } . 11: end if 12: else if L = 2 n + 2 then 13: decompose x = u ⊕ ( x n +1 , x n +2 ) ⊕ v , where u and v are vectors of length n . 14: ϕ ← x n +1 15: if x n +1 = x n +2 then 16: there is a single deletion in the position of [1 ..n + 2] ; is _ del ← tr ue 17: obtain c generated by x [ n + 2 .. 2 n + 1] with the information set { n + 1 , . . . , 2 n } . 18: else if x n +1 = x n +2 then 19: there is no single-deletion in u 20: obtain c generated by u with the information set { 1 , . . . , n } . 21: if c [ n + 1 .. 2 n ] = x [ n + 3 .. 2 n + 2] then 22: there is no single-deletion; is _ del ← f alse 23: else 24: there is a single-deletion at [ n + 3 .. 2 n ] ; is _ del ← tr ue 25: end if 26: end if 27: end if 28: return c , ψ = x n +1 , is _ del Pr oof. W e prov e the theorem by induction on m , the number of codew ords. When m = 1 , any single-deletion-error in the first codeword d 1 can be decoded us- ing Algorithm 6. That is, a sequence consisting of a single code word d 1 forms a self-synchronizing SDC sequence. Suppose, as the induction hypothesis, that for a positi ve integer k , a sequence of k codew ords without delimiters, d 1 d 2 · · · d k , is a self-synchronizing SDC sequence. Now consider a sequence of k + 1 code- words, d 1 d 2 · · · d k d k +1 without delimiters. By the induction hypothesis, the first 11 k codew ords d 1 d 2 · · · d k form a self-synchronizing SDC sequence. Thus, it suf- fices to prove that an y single-deletion error occuring in d k +1 can be corrected. If the last symbol of d k is not equal to the first symbol of d k +1 ,there is no ambiguity in determining the starting position of d k +1 , and thus d k +1 can be decoded using Algorithm 13. If, on the other hand, the last symbol of d k equals the first symbol of d k +1 and if the last symbol of d k is deleted, then Algorithm 13 cannot detect the single-deletion in d k , will mistakenly interpret the first symbol of d k +1 as the last symbol of d k . If a single-deletion occurs within d k +1 , Algorithm 13 would process a subsequence d k +1 with two deleted symbols, resulting in a decoding failure of d k +1 . Ho wev er , encoding rule (iii) ensures that the last symbol of d k is always different from the first symbol of d k +1 . Therefore, no confusion arises when determining the first symbol of d k +1 and d k +1 can be correctly decoded using Algorithm 13. This completes the induction and the proof. ■ Next, we propose Algorithms 15 and 16. Based on these algorithms, we propose a method for self-synchronizing SDC DN A codes in Theorem 17. Algorithm 15 Encoding algorithm for self-synchronizing SDC codes Require: a sequence of m codew ords ( c 1 , c 2 , · · · , c r ) Ensure: sequence Σ = d 1 d 2 · · · d r with no delimiters. 1: x ← c 1 2: con vert x to x ψ . 3: Σ ← x ψ 4: f or n = 2 , . . . , r do 5: x ← the i -th code word c i 6: con v ert x to x ψ . 7: σ ← the first symbol of x ψ 8: if σ = λ then 9: x ψ ← x ψ + 1 10: end if 11: concatenate x ψ to Σ without delimiter 12: λ ← the last symbol of Σ . 13: end f or 12 Algorithm 16 Decoding algorithm for self-synchronizing SDC codes Require: a sequence Σ = d 1 d 2 · · · d r with no delimiters and possible single- deletion errors, encoded using Algorithm 15. Ensure: sequence Λ = ( c 1 , c 2 , · · · , c r ) 1: Λ ← the empty sequence 2: m ← the number of symbols in sequence Σ 3: while m > 0 do 4: if m ≥ 2 n + 2 then 5: d ← Σ[1 .. (2 n + 2)] 6: else if m = 2 n + 1 then 7: d ← Σ[1 .. (2 n + 1)] 8: else 9: Σ is undecodable; terminate. 10: end if 11: apply Algorithm 13 on d and obtain c and ϕ 12: if d is prov ed to hav e a single-deletion then 13: Σ ← Λ[(2 n + 2) ..m ] 14: m ← m − (2 n + 1) 15: else 16: Σ ← Λ[(2 n + 3) ..m ] 17: m ← m − (2 n + 2) 18: end if 19: a ← c [ n ]; b ← c [ n + 3] 20: if ψ ( a, b ) = ϕ + 1 then 21: c ← c + 1 22: else 23: pass 24: end if 25: append c to Λ 26: end while 27: return Λ Theorem 17. Let ψ be the map defined in Definition 11. Assume that C is a CIS code of length 2 n ov er F 4 having all-one v ector as code words, and let C ψ be the set of vectors x ψ for all code words x in C . Then, D = µ ( C ψ ) is a self-synchronizing single-deletion-correcting DN A code, and its encoding and decoding algorithms can be achie ved using Algorithms 15 and 16. Pr oof. This result is straightforward from Theorem 14. ■ 4. I M P L E M E N TA T I O N O N D NA C O D E S This section proposes the encoding and decoding algorithms for binary data using a self-synchronizing SDC DN A code. W e implement these algorithms with Python. W e assume that C is a CIS code of length 2 n o ver F 4 having an all-one 13 vector , ψ is the map defined in Definition 11, and n h is a suf ficiently large fix ed positi ve inte ger such that 2 n h ≥ is the length of bin _ data . Algorithm 18 (Encoding algorithm for self-synchroning SDC DN A codes) . Input: bin _ data := binary data, such as. txt file, image files, etc. Output: Σ := encoded DNA sequence of bin _ data Step 1. [Con vert bin _ data to bin _ seq ] Add header as metadata for bin _ data and zero-padding so that the length of the con v erted binary sequence, bin _ seq , is a multiple of 2 n . - (Length check) let ℓ be the length of bin _ data - (Header) con vert ℓ to binary number ℓ 2 of n h digits. - (Zero-padding) let ℓ p be all 0 sequence of length 2 n − ( ℓ + n h ) modulo 2 n . - Let bin _ seq be the concatenated sequence of ℓ 2 , ℓ p , and bin _ data , in this order . Step 2. [Con vert bin _ seq to a pre- dna _ seq ] Con v ert e v ery tw o each symbols in bin _ data to a DN A symbol, A, C , G and T under the map δ to obtain pre- dna _ seq : p r e − dna _ seq = δ ( bin _ seq ) . The length of pre- dna _ seq is a multiple of n . Step 3. [Di vide pre- dna _ seq ] Di vide ev ery n symbol of pre- dna _ seq and let r be the number of di vi- sions. Then, pre- dna _ seq is in the form ( m 1 , m 2 , · · · , m r ) , where m i is a sequence of n DN A symbols for 1 ≤ i ≤ r . Step 4. [Encode each block of pre- dna _ seq ] Encode each m i of pre- dna _ seq to DN A code word of C as follo ws. - Con vert each m i to a 2 × 2 n matrix f ( m i ) ov er F 2 by map f . - Encode each f ( m i ) to a codeword f ( c i ) by multiplying the generator matrix of C under map τ . - Apply map f − 1 to f ( c i ) and obtain the DN A code word c i of C . Thus, we obtain the sequence of DN A code words ( c 1 , c 2 , · · · , c r ) Step 5. [Encoding to self-synchronizing SDC sequence] Apply Algorithm 15 to the sequence of DN A codew ords ( c 1 , c 2 , · · · , c r ) . Finally , we obtain the DNA sequence Σ = d 1 d 2 · · · d r with no delimiters. Algorithm 19 (Decoding algorithm) . Input: DN A sequence Σ = d 1 d 2 · · · d r with possible single-deletion errors and no delimiters. Output: bin _ data := the original binary data. 14 Step 1. [Decode Σ ] Apply Algorithm 16 to the DN A sequence Σ = d 1 d 2 · · · d r with no delim- iters. Then we obtain the sequence of DN A codewords Λ = ( c 1 , c 2 , · · · , c r ) Step 2. [Obtain pre- dna _ seq ] For each DN A codew ord c i in Λ , let m i be c i [1 ..n ] . Concatenating m i for all i gi ves the pre- dna _ seq . Step 3. [Con vert pre- dna _ seq to bin _ seq ] Obtain bin _ seq using the in verse map of δ , that is, bin _ seq = δ − 1 ( p r e − dna _ seq ) . Step 4. [Obtain original binary data] The binary number made by the first n h digits of bin _ seq is the number of the length of the original binary data. T ake that amount of digits of bin _ seq , counting from the end of bin _ seq , to obtain the original binary data bin _ data . W e provide the following e xamples, which illustrate Algorithms 18 and 19. In the following examples, we use a reversible CIS [6,3,3]-code in the encoding and decoding process. W e exploited the re versibility of code words when decoding single-error code words. F or details on the re versible code, please refer [3, 11]. Example 20. Let C be a re versible self-dual code of length 6 ov er F 4 with generator matrix G = 1 0 0 ω 1 ω 0 1 0 ω 2 ω 2 1 0 0 1 0 ω 2 ω , and set the header length to n h = 6 . It is easy to verify that C is an all-one vector as a code word. The con version of the generator matrix o ver DN A symbols is µ − 1 ( G ) = T A A C T C A T A G G T A A T A G C , and the con v ersion of the generator matrix ov er GF(2) is τ ( G ) = 100000011001 010000110111 001000111110 000100101001 000010001101 000001001011 . Assume that we hav e binary data of 19 bits: bin _ data = 1010100101010100111 . Step 1. [Con vert bin _ data to bin _ seq ] Since we set the header length n h = 6 , the header becomes 010011 as the six-digit binary representation of length 19 . Thus, we need fi ve 0 ’ s for 15 zero-padding so that the concatenated binary sequence has a length of 30, a multiple of 6 : bin _ seq = 010011 / 00000 / 1010100101010100111 . Note that there are m = 5 blocks in this sequence. Step 2. [Con vert bin _ seq to pre- dna _ seq ] The concatenated data is con verted to the pre- dna _ seq of length 30 / 2 = 15 : pre- dna _ seq = δ ( bin _ seq ) = GAT AAGGGAC C C C GT . Step 3. [Di vide pre- dna _ seq ] The sequence GAT AAGGGAC C C C GT is divided into fiv e subsequences of 3 DN A symbols: GAT , AAG, GGA, C C C, C GT . Step 4. [Encode each block of pre- dna _ seq ] Each subsequence of pre- dna _ seq is con verted to a 2 × 6 matrix using map f : 110010 100001 , 000011 000010 , 111100 101000 , 010101 111111 , 011110 111001 . By multiplying the generator matrix τ ( G ) with each block, we obtain fi ve code words of 2 × 12 matrices. For example, the first block is encoded as 110010 100001 100000011001 010000110111 001000111110 000100101001 000010001101 000001001011 = 110010100011 100001010010 . Applying f − 1 gi ves f − 1 110010100011 100001010010 = GAT T AG. Repeating this process for all the codewords and concatenating them, we obtain the sequence of DN A code words GAT T AG, AAGAC T , GGAGT C, C C C C C C, C GT T GC. Step 5. [Encoding to self-synchronizing SDC sequence] Applying Algorithm 19, the sequence of DN A codewords is con v erted into GAT C C T AG, AAGT T AC T , GGAT T GT C , GGGT T GGG, C GT C C T GC . The fourth block C C C C C C is conv erted into C C C AAC C C at first; ho wev er , the first symbol C is identical to the last symbol of the previ- ous code word GGAT T GT C . Thus, we take GGGT T GGG , the comple- ment of C C C AAC C C . Therefore, as the self-synchronizing SDC DNA 16 sequence, we obtain Σ = GAT C C T AGAAGT T AC T GGA T T GT C GGGT T GGGC GT C C T GC. Example 21. Assume that the sender sends the original DN A sequence from Ex- ample 20: Σ = GAT C C T AGAAGT T AC T GGAT T GT C GGGT T GGGC GT C C T GC , and assume that we hav e the information of the code C and n h = 6 . Suppose that four single-deletion errors occur during transmission as follo ws: Σ = GAT C C T AGA @ @ AGT T AC T GGA @ T T GT C GGGT T G @ @ GGC @ @ GT C C T GC . Thus, we recei ve the follo wing DN A sequence: Σ ′ = GAT C C T AGAGT T AC T GGAT GT C GGGT T GGC T C C T GC Step 1. [Decode Σ ′ ] Apply Algorithm 16 to the DN A sequence Σ ′ . - Set Λ = () , empty sequence and m = 36 , the number of symbols in Σ ′ - Since m = 36 > 0 , the first iteration begins. (1) m > 8 ; thus, we set d = GAT C C T AG . (2) Apply Algorithm 13 on d : - d = GAT C C T AG is decomposed into GAT , C C , T AG and set ϕ = C . - Since x n +1 = C = x n +2 , there is no deletion in position d [1 .. 3] = GAT . - f ( GAT ) = 110010 100001 , and multiplying τ ( G ) , we obtain f ( c ) = 110010100011 100001010010 ; therefore, c = GAT T AG. (3) Since c [4 .. 6] = d [6 .. 8] , we conclude that there is no deletion. Thus, m = m − 8 = 28 and Σ ′ becomes Σ ′ = AGT T AC T GGAT GT C GGGT T GGC T C C T GC . (4) Since ψ ( a, b ) = ψ ( T , T ) = C = ϕ , we append c to Λ , that is, Λ = ( GAT T AG ) and the first iteration ends. - Since m = 28 > 0 , the second iteration begins. (1) m > 8 ; thus, we set d = AGT T AC T G . (2) Apply Algorithm 13 on d : - d = AGT T AC T G is decomposed into AGT , T A, C T G , and set ϕ = T . - Since x n +1 = T = A = x n +2 , there is a single deletion in position [1..5]. 17 - W e take f ( d [5 .. 7]) = f ( AC T ) with information set [4..6]. T o obtain c , we use f ( T C A r ) = f ( T C A ) = 100100 011100 . By multiplying τ ( G ) , we obtain 100100 011100 100000011001 010000110111 001000111110 000100101001 000010001101 000001001011 = 100100110000 011100100000 . Thus, we see that T C AGAA is a codeword of C , and the re versibility of C ensure that AAGAC T is also a codeword in C ha ving AC T in information set [4..6]. Therefore, we conclude that c = AAGAC T . (3) Since d has a single-deletion, m = m − 7 = 21 and Σ ′ becomes Σ ′ = GGAT GT C GGGT T GGC T C C T GC, returning the last symbol G of d . (4) Since ψ ( a, b ) = ψ ( G, A ) = T = ϕ , we append c to Λ , and the second iteration ends. - Since m = 21 > 0 , the third iteration begins. (1) m > 8 ; thus, we set d = GGAT GT C G . (2) Apply Algorithm 13 on d : - d = GGAT GT C G is decomposed into GGAT GT C G , and set ϕ = T . - Since x n +1 = T = G = x n +2 , there is a single deletion in position [1..5]. - W e take f ( d [5 .. 7]) = f ( T C G ) with information set [4..6]. - W e take a similar process as the second iteration, and ob- tain c = GGAGT C . (3) Since d has a single-deletion, m = m − 7 = 14 and Σ ′ becomes Σ ′ = GGGT T GGC T C C T GC , returning the last symbol G of d . (4) Since ψ ( a, b ) = ψ ( A, G ) = T = ϕ , we append c to Λ , and the third iteration ends. - Since m = 14 > 0 , the fourth iteration begins. (1) m > 8 ; thus, we set d = GGGT T GGC . (2) Apply Algorithm 13 on d : - d = GGGT T GGC is decomposed into GGG, T T , GGC , and set ϕ = T . - Since x n +1 = T = x n +2 , there is no single deletion in position [1..5]. - W e take f ( d [1 .. 3]) = f ( GGG ) with information set [1..3]. 18 - W e take a similar process as the first iteration, and obtain c = GGGGGG . (3) Since c [4 .. 6] = GGG = GGC = d [6 .. 8] , d has a single- deletion. Thus m = m − 7 = 7 and Σ ′ becomes Σ ′ = C T C C T GC , returning the last symbol C of d . (4) Since ψ ( G, G ) = A = T = ϕ , we append c + 1 = C C C C C C to Λ , and the fourth iteration ends. - Since m = 7 > 0 , the fifth iteration begins. (1) m = 7 = 2 n + 1 ; thus, we set d = C T C C T GC . (2) Apply Algorithm 13 on d : - d = C T C C T GC is decomposed into C T C, C , T GC , and set ϕ = C . - Since x n +1 = C = T = x n +2 , there is a single deletion in position [1..5]. - W e take f ( d [5 .. 7]) = f ( T GC ) with information set [4..6]. - W e take a similar process as the second iteration, and ob- tain c = C GT T GC . (3) Since d has a single-deletion, m = m − 7 = 0 and Σ ′ becomes empty . (4) Since ψ ( a, b ) = ψ ( T , T ) = C = ϕ , we append c to Λ , and the fifth iteration ends. - Since m = 0 , it is terminated. - It returns Λ = ( GAT T AG, AAGAC T , GGAGT C, C C C C C C, C GT T GC ) . Step 2. [Obtain pre- dna _ seq ] From Λ = ( GAT T AG, AAGAC T , GGAGT C , C C C C C C, C GT T GC ) , we obtain the pre- dna _ seq : GAT AAGGGAC C C C GT Step 3. [Con vert pre- dna _ seq to bin _ seq ] The map δ − 1 ( GAT AAGGGAC C C C GT ) gi ves bin _ seq = 010011000001010100101010100111 . Step 4. [Obtain the original binary data] Since n h = 6 , the length of the original binary data is binary 010011 , equi valently , 19. Therefore we take 19 digits of bin _ seq from the end, and the original binary data bin _ data is 1010100101010100111 . 5. C O N C L U D I N G R E M A R K S This study introduces a no vel approach for correcting single-deletion errors in continuous transmissions without delimiters with a self-synchronizing capability . Whereas traditional error-correcting codes concentrate only on substitution errors, applications of coding theory in bioinformatics encompass a wider range of errors, including deletions, insertions, and erasures, kno wn as synchronization errors. The 19 historical context of the self-synchronization problem in biology , particularly in DN A sequences, has been explored since the disco very of DN A. W e point out that Francis Crick’ s early proposal of “codes without commas” for DNA sequences, though prov en biologically incorrect, inspired mathematicians interested in syn- chronization problems. While this work pro vides a theoretical foundation for the construction of single- deletion-correcting DN A codes using CIS codes, we acknowledge that our focus has been on mathematical formulation and analysis rather than immediate practical implementation. As a result, certain challenges remain regarding the direct appli- cation of our approach to real-world DNA storage and DNA computing systems. Addressing these practical aspects—including e xperimental v alidation and adapta- tion to the constraints of DN A synthesis, sequencing, and channel noise—will be an important direction for future research. W e hope that our results will serve as a stepping stone for both theoretical adv ances and future applications in the field. W e hope to find applications of self-synchronizing single-deletion correcting codes in various research fields, such as DN A computing and computer designs, DN A data-storage devices, and synthetic DNA sequence designs. In future work, we will explore the applications of DN A coding theory and continue to study self- synchronizing codes with multi-deletion or insertion-correcting capabilities. A C K N O W L E D G M E N T This work is supported by the the National Research Foundation of K orea (NRF) grant funded by the K orea go vernment (2019R1I1A1A01057755, 2022R1C1C2011689). R E F E R E N C E S [1] K.A. Abdel-Ghaf far , H.C. Ferreira, and L. Cheng, “Correcting deletions using linear and c yclic codes, ” IEEE T rans. Inform. Theory , vol. 56, no. 10, pp. 5223–5234, 2010. [2] J. Cannon, C. Playoust, “ An Introduction to Magma. ” Univ ersity of Sydney , Sydney , Australia, 1994. [3] W .-H. Choi, H.J. Kim, and Y . Lee, “Construction of single-deletion-correcting DN A codes using CIS codes, ” Des. Codes Cryptogr . , vol. 88, pp. 2581–2596, 2020. [4] C. Carlet, P . Gaborit, J-L. Kim, and P . Sol ´ e, “ A ne w class of codes for Boolean masking of cryptographic computations, ” IEEE T rans. Inform. Theory , vol. 58, pp. 6000–6011, 2012. [5] F .H.C. Crick, J.S. Griffith, and L. E. Orgel, “Codes without commas, ” Pr oc. Natl. Acad. Sci. U.S.A. , v ol. 43, no. 5, pp. 416–421, 1957. [6] P . Gaborit and O.D. King, “Linear constructions for DN A codes, ” Theor et. Comput. Sci. , vol. 334(1-3), pp. 99–113, 2005. [7] S. W . Golomb, G. Basil, and R.W . Llo yd, “Comma-free codes, ” Can. J. Math , vol. 10, pp. 202–209, 1958. [8] S.K. Hanna, “Ef fectiv e IDS Error Correction Algorithms for DNA Storage Channels W ith Mul- tiple Output Sequences, ” IEEE T rans. Inf. Theory , v ol. 69, no. 9, pp. 5687–5700, 2023. [9] B. Haeupler, A. Shahrasbi, “Synchronization Strings and Codes for Insertions and Deletions - A Surve y , ” IEEE T rans. Inf. Theory , vol. 67, no. 6, pp. 3190–3206, 2021. [10] W .C. Huffman, V . Pless, Fundamentals of Err or-Corr ecting Codes , Cambridge Uni versity Press, Cambridge, 2003. [11] H.J. Kim, W .-H. Choi, and Y . Lee, “Construction of reversible self-dual codes, ” F inite Fields Appl. , vol. 67, pp. 101714, 2020. 20 [12] H.J. Kim, W .-H. Choi, and Y . Lee, “Designing DN A codes from rev ersible self-dual codes ov er GF (4) , ” Discrete Math. , vol. 344, no. 1, pp. 112159, 2021. [13] H.J. Kim and Y . Lee, “Complementary information set codes over GF ( p ) , ” Des. Codes Cryp- togr . , vol. 81, pp. 541–555, 2016. [14] O.D. King, “Bounds for DNA codes with constant GC-content, ” Electr on. J. Combin. , vol. 10, R33, 2003. [15] V .I. Lev enshtein, “Binary codes capable of correcting deletions, insertions, and reversals, ” In Soviet physics doklady , vol. 10, no. 8, pp. 707–710, 1966. [16] J. Le vy , “Self-synchronizing codes deri ved from binary c yclic codes, ” IEEE T r ans. Inf. Theory , vol. 12, no. 3, pp. 286–290, 1966. [17] F .J. MacW illiams, N.J.A. Sloane, The theory of error -corr ecting codes , North-Holland, Ams- terdam, 1977. [18] J.L. Massey , “Optimum Frame Synchronization, ” IEEE T rans. Commun. , vol. 20, no. 2, pp. 115–119, 1972. [19] F . Sala, R. Gabrys, C. Schoen y , and L. Dolecek, “Exact Reconstruction From Insertions in Synchronization Codes, ” IEEE T rans. Inf. Theory , v ol. 63, no. 4, pp. 2428–2445, 2017. [20] V an Rossum, G., & Drake, F . L. (2009). Python 3 Reference Manual. Scotts V alley , CA: Cre- ateSpace. [21] Z. Y an, C. Liang, and H. W u, “ A Segmented-Edit Error-Correcting Code W ith Re- Synchronization Function for DN A-Based Storage Systems, ” IEEE T rans. Emer g. T op. Com- put. , vol. 11, no. 3, pp. 605-618, 2023. D E PART M E N T O F M AT H E M A T I C S K A N G W O N N AT I O NA L U N I V E R S I T Y C H U N C H E O N 2 4 3 4 1 , K O R E A K A N G W O N R E S E A R C H I N S T I T U T E O F M A T H E M A T I C A L S C I E N C E S , K A N G W O N N A T I O NA L U N I V E R S I T Y , C H U N C H E O N 2 4 3 4 1 , K O R E A , E - M A I L : WH C H O I @ K A N G W O N . AC . K R ,
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment