Our goal is to describe an encoder-decoder for storing data in DNA where:
(1) The GC-AT content is at most 5% imbalanced.
(2) Homopolymer-runs are at most of length 3.
(3) The targets are encoded quaternary strands of length 100-500 nucleotides.
(4) The encoder-decoder should work at linear time (at most).
(5) Demands reasonable memory requirements.
Additionally, we assume that there will be a singular output length for every input length. In algorithms where the output length varies, we chose the worst-case output length.
Reliability is an inherent challenge for the emerging nonvolatile technology of racetrack memories, and there exists a fundamental relationship between codes designed for racetrack memories and codes with constrained periodicity. Previous works have sought to construct codes that avoid periodicity in windows yet have either only provided existence proofs or required high redundancy. This paper provides the first constructions for avoiding periodicity that are both efficient (average-linear time) and with low redundancy (near the lower bound). The proposed algorithms are based on iteratively repairing windows which contain periodicity until all the windows are valid. Intuitively, such algorithms should not converge as there is no monotonic progression; yet, we prove convergence with average-linear time complexity by exploiting subtle properties of the encoder. Overall, we both provide constructions that avoid periodicity in all windows, and we also study the cardinality of such constraints.
Multiple Sequence Alignment refers to the process of aligning a cluster of strands, usually protein or DNA, to achieve a maximal regions of similarity. More specifically, given a set S of m erroneous strands of DNA with different lengths, that assumingly originated from one common reference, the outcome of the MSA algorithm, is to be m strands with gap insertions into each strand, such that all conform to a length đżâ„max{đđ||đđ|=đ_đ}, and no index 0â€đâ€đż, yields a column consisting of only gaps, and the alignment maximizes the common substrings of the strands. MSA has been shown to be NP-complete problem. This work shows a new Sequence Alignment method, that leverage existing sequencing algorithms, in order to achieve a speedup of x80 over state of the art Multiple Sequencing Alignment algorithms. The method uses an existing pairwise alignment algorithm called FOGSAA [1], as the base of an Iterative Algorithm, along with two other phases. Further applications and Enhancements of the suggested method can further contribute to Modify the sequencing and alignment phases, in order to achieve a robust and efficient DNA data storage system.
DNA synthesis plays a pivotal role in modern molecular biology, enabling scientists to engineer and manipulate DNA for various applications, including the cutting-edge field of information storage in DNA. At the heart of molecular biology lies the ability to synthesize custom DNA sequences with precision. These sequences serve as the genetic code that defines the traits of living organisms. The remarkable data storage potential of DNA has garnered significant attention in recent years. DNA molecules can store vast amounts of digital information in their chemical structure, offering a durable and compact storage medium that holds promise for long-term data archiving. However, unlocking this potential requires a deep understanding of DNA synthesis processes. While the field of DNA synthesis has witnessed tremendous advancements, it is essential to acknowledge the cost constraints associated with experimental approaches. Traditional DNA synthesis techniques, including Maskless Array Synthesis (MAS) and others, often demand substantial financial investments. These costs encompass equipment, reagents, operational expenses, and on-going maintenance, making them a barrier for many research projects. To address the cost challenges and make research in information storage in DNA more accessible and sustainable, simulation emerges as an attractive alternative. Simulators provide a cost-effective means of studying DNA synthesis processes, allowing researchers to model and experiment with various scenarios without the need for expensive physical equipment and materials. In this project, we have created a simulator tailored specifically for Maskless Array Synthesis (MAS). This simulator is intended to facilitate the study of MAS processes in the context of DNA storage research while alleviating the financial constraints associated with physical experimentation. Grounded in the latest research insights and findings, including those detailed in the study titled âChemical and photochemical error rates in light-directed synthesis of complex DNA librariesâ, our MAS simulator represents a methodical effort to address the cost limitations. It provides a practical, cost-effective, and scalable solution for researchers engaged in the investigation of information storage in DNA through the MAS technique.
Composite DNA letters were introduced recently. The technology consists of synthesis and sequencing methods that exploit the built-in information redundancy of DNA data archiving. These methods involve using a logical enlarged alphabet, rather than the pure DNA bases only, and take advantage of the multiplication of DNA strands in current synthesis technologies to use fewer synthesis cycles per unit of data. In this project, we propose a new approach for DNA reconstruction over composite DNA alphabets. Based on another recent paper âHedgesâ [1], we implemented a Hashed based Error Correcting Code (HECC) for encoding and decoding binary messages to and from composite alphabets. Our HECC handles all three basic types of DNA errors: substitutions, insertions, and deletions, up to an error rate of 2% while synthesizing only 40 copies of the original strand. When handling substitutions only, our HECC succeeds in decoding up to an error rate of 15% while synthesizing only 20 copies of the original strand. We first present a short introduction to both the Composite and the Hedges papers. Secondly, we offer a detailed explanation of our encoding-decoding scheme and the specific methods we implemented. We then present our results and discuss some future possible optimizations to our algorithm.
[1] William H. Press, John A. Hawkins, Stephen K. Jones Jr, Ilya J. Finkelstein, HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints, PNAS, 2020.
Composite DNA letters were introduced recently. The technology consists of synthesis and sequencing methods that exploit the built-in information redundancy of DNA data archiving. These methods involve using a logical enlarged alphabet, rather than the pure DNA bases only, and take advantage of the multiplication of DNA strands in current synthesis technologies to use fewer synthesis cycles per unit of data. In this project, we propose a new approach for DNA reconstruction over composite DNA alphabets. Based on another recent paper âHedgesâ [1], we implemented a Hashed based Error Correcting Code (HECC) for encoding and decoding binary messages to and from composite alphabets. Our HECC handles all three basic types of DNA errors: substitutions, insertions, and deletions, up to an error rate of 2% while synthesizing only 40 copies of the original strand. When handling substitutions only, our HECC succeeds in decoding up to an error rate of 15% while synthesizing only 20 copies of the original strand. We first present a short introduction to both the Composite and the Hedges papers. Secondly, we offer a detailed explanation of our encoding-decoding scheme and the specific methods we implemented. We then present our results and discuss some future possible optimizations to our algorithm.
[1] William H. Press, John A. Hawkins, Stephen K. Jones Jr, Ilya J. Finkelstein, HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints, PNAS, 2020.
DNA-Storalator is a cross-platform software tool that simulates the complete process of encoding, storing, and decoding digital data in DNA molecules. The simulator receives an input file with the designed DNA strands that store digital data and emulates the different biological and algorithmical components of the storage system. The biological component includes simulation of the synthesis, PCR, and sequencing stages which are expensive and complicated and therefore are not widely accessible to the community. These processes amplify the data and generate noisy copies of each DNA strand, where the errors are insertions, deletions, long-deletions, and substitutions.