On the Palindromic/Reverse-Complement Duplication Correcting Codes
Motivated by applications in in-vivo DNA storage, we study codes for correcting duplications. A reverse-complement duplication of length $k$ is the insertion of the reversed and complemented copy of a substring of length $k$ adjacent to its original position, while a palindromic duplication only inserts the reversed copy without complementation. We first construct an explicit code with a single redundant symbol capable of correcting an arbitrary number of reverse-complement duplications (respectively, palindromic duplications), provided that all duplications have length $k \ge 3\lceil \log_q n \rceil$ and are disjoint. Next, we derive a Gilbert-Varshamov bound for codes that can correct a reverse-complement duplication (respectively, palindromic duplication) of arbitrary length, showing that the optimal redundancy is upper bounded by $2\log_q n + \log_q\log_q n + O(1)$. Finally, for $q \ge 4$, we present two explicit constructions of codes that can correct $t$ length-one reverse-complement duplications. The first construction achieves a redundancy of $2t\log_q n + O(\log_q\log_q n)$ with encoding complexity $O(n)$ and decoding complexity $O\big(n(\log_2 n)^4\big)$. The second construction achieves an improved redundancy of $(2t-1)\log_q n + O(\log_q\log_q n)$, but with encoding and decoding complexities of $O\big(n \cdot \mathrm{poly}(\log_2 n)\big)$.
💡 Research Summary
**
The paper addresses error‑correcting codes for two specific duplication errors that arise in in‑vivo DNA storage: reverse‑complement duplications (the copied substring is reversed and complemented) and palindromic duplications (the copy is only reversed). The authors first introduce the notion of an m‑RCD root, a string that contains no adjacent substrings of length m that are reverse‑complements of each other. They prove that if a codeword is an m‑RCD root, any reverse‑complement duplication of length k ≥ 3m − 3 can be uniquely identified and removed. By choosing m ≈ ⌈log_q n⌉ + 1 and adding a single redundancy symbol (e.g., a global checksum), they construct an explicit code that corrects an arbitrary number of such duplications, provided each duplication is at least 3⌈log_q n⌉ long and the duplications are disjoint.
Next, the authors derive a Gilbert–Varshamov bound for codes that must correct a single reverse‑complement (or palindromic) duplication of any length. By counting the “duplication balls” around each word and ensuring they are disjoint, they show that there exist codes with redundancy at most
2 log_q n + log_q log_q n + O(1),
which improves on earlier results that achieved only log_q n redundancy.
Finally, for alphabet sizes q ≥ 4, the paper presents two explicit families of codes that correct t length‑1 reverse‑complement duplications. Both constructions transform each length‑1 duplication into a substitution error by imposing run‑length constraints and then combine a substitution‑correcting code with an indel‑correcting code. The first family attains redundancy 2t log_q n + O(log_q log_q n), with encoding complexity O(n) and decoding complexity O(n (log n)^4). The second family improves the redundancy to (2t − 1) log_q n + O(log_q log_q n) at the cost of higher computational cost, namely O(n·poly(log n)) for both encoding and decoding. Both constructions dramatically reduce redundancy compared with generic indel‑correcting codes, which would require about 5 log_q n symbols for the same task.
Overall, the work provides (i) a low‑overhead, single‑symbol redundancy scheme for long duplications, (ii) a near‑optimal existential bound for arbitrary‑length duplications, and (iii) practical, efficiently decodable codes for multiple short duplications, thereby advancing the reliability of DNA‑based data storage systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment