Composite DNA letters were introduced recently. The technology consists of synthesis and sequencing methods that exploit the built-in information redundancy of DNA data archiving. These methods involve using a logical enlarged alphabet, rather than the pure DNA bases only, and take advantage of the multiplication of DNA strands in current synthesis technologies to use fewer synthesis cycles per unit of data. In this project, we propose a new approach for DNA reconstruction over composite DNA alphabets. Based on another recent paper “Hedges” [1], we implemented a Hashed based Error Correcting Code (HECC) for encoding and decoding binary messages to and from composite alphabets. Our HECC handles all three basic types of DNA errors: substitutions, insertions, and deletions, up to an error rate of 2% while synthesizing only 40 copies of the original strand. When handling substitutions only, our HECC succeeds in decoding up to an error rate of 15% while synthesizing only 20 copies of the original strand. We first present a short introduction to both the Composite and the Hedges papers. Secondly, we offer a detailed explanation of our encoding-decoding scheme and the specific methods we implemented. We then present our results and discuss some future possible optimizations to our algorithm.
[1] William H. Press, John A. Hawkins, Stephen K. Jones Jr, Ilya J. Finkelstein, HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints, PNAS, 2020.