Robust Composite DNA Storage under Sampling Randomness, Substitution, and Insertion-Deletion Errors

Robust Composite DNA Storage under Sampling Randomness, Substitution, and Insertion-Deletion Errors
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

DNA data storage offers a high-density, long-term alternative to traditional storage systems, addressing the exponential growth of digital data. Composite DNA extends this paradigm by leveraging mixtures of nucleotides to increase storage capacity beyond the four standard bases. In this work, we model composite DNA storage as a multinomial channel and draw an analogy to digital modulation by representing composite letters on the three-dimensional probability simplex. To mitigate errors caused by sampling randomness, we derive transition probabilities and log-likelihood ratios (LLRs) for each constellation point and employ practical channel codes for error correction. We then extend this framework to substitution and insertion-deletion (ID) channels, proposing constellation update rules that account for these additional impairments. Numerical results demonstrate that our approach achieves reliable performance with existing LDPC codes, compared to the prior schemes designed for limited-magnitude probability errors, whose performance degrades significantly under sampling randomness.


💡 Research Summary

**
This paper addresses the emerging paradigm of composite DNA storage, where each logical symbol is not a single nucleotide (A, C, G, T) but a prescribed mixture of nucleotides across many physical copies of a strand. The authors model this situation as a multinomial channel: a transmitted composite symbol is represented by a probability vector ρₛ = (ρₛ,A, ρₛ,C, ρₛ,T, ρₛ,G) that lies on the three‑dimensional probability simplex Δₗ. By interpreting each ρₛ as a constellation point, they draw a direct analogy to digital modulation schemes, enabling the use of well‑established communication‑theoretic tools.

The system works as follows. An input binary message m is encoded with a forward error‑correcting code (the paper focuses on low‑density parity‑check (LDPC) codes) to produce a codeword c. The codeword is partitioned into L‑bit blocks; each block selects one of the 2ᴸ possible constellation points. During DNA synthesis and sequencing, n physical copies of each strand are read. The observed nucleotide counts (k_A, k_C, k_T, k_G) follow a Multinomial(n, ρₛ) distribution, which the authors denote as the multinomial channel. The probability of observing a particular count vector dᵢ given a transmitted ρₛ is given explicitly in equation (4).

From this transition probability, the posterior probability P(ρₛ | dᵢ) is proportional to P(dᵢ | ρₛ) when a uniform prior over constellation points is assumed. The log‑likelihood ratios (LLRs) for each bit of the L‑bit block are then computed as the ratio of summed posteriors over all constellation points whose corresponding bit is 0 versus 1 (equation 9). The resulting LLR vector for the whole codeword is fed directly into a standard binary LDPC decoder (e.g., sum‑product or min‑sum algorithm). This approach leverages existing LDPC implementations without requiring a custom decoder for the composite DNA channel.

The authors extend the basic model to incorporate two common DNA storage impairments: substitution errors and insertion‑deletion (ID) errors. For substitutions, each nucleotide is independently replaced by a random alternative with probability ε. They show that the effective probability vector after the substitution channel becomes ˆρₛ = (1 − ε) ρₛ + (ε/3)(1 − ρₛ) (equations 12‑13). By substituting ˆρₛ for ρₛ in the multinomial likelihood, the LLR computation automatically accounts for the extra noise, and it also resolves potential 0/0 indeterminate forms that arise when a component of ρₛ is zero but a substitution introduces that nucleotide in the observation.

Insertion‑deletion errors are modeled as independent per‑position events: with probability p_i a random nucleotide is inserted after the current base, and with probability p_d the current base is deleted. Exact likelihood computation would require enumerating all possible insertion/deletion patterns, which is computationally prohibitive. The authors therefore adopt a pragmatic sub‑optimal strategy: they condition on receiving strands of the original length E (i.e., the number of insertions equals the number of deletions) and discard all other outcomes. Under this condition, they compute the probability p_ns,i that a given position experiences “no shift” (equation 14) and use it to adjust the constellation points before LLR calculation. Although this approximation sacrifices optimality, simulation results show that it still yields substantial performance gains over naïve decoding.

Performance evaluation is carried out with several LDPC codes of different block lengths and rates. In the pure sampling‑randomness scenario (no substitutions or ID), the proposed multinomial‑LLR method achieves error‑rate curves close to the theoretical capacity of the multinomial channel and markedly outperforms prior schemes that treat composite DNA errors as limited‑magnitude probability perturbations (e.g., the models in


Comments & Academic Discussion

Loading comments...

Leave a Comment