Our goal is to describe an encoder-decoder for storing data in DNA where:
(1) The GC-AT content is at most 5% imbalanced.
(2) Homopolymer-runs are at most of length 3.
(3) The targets are encoded quaternary strands of length 100-500 nucleotides.
(4) The encoder-decoder should work at linear time (at most).
(5) Demands reasonable memory requirements.
Additionally, we assume that there will be a singular output length for every input length. In algorithms where the output length varies, we chose the worst-case output length.
Reliability is an inherent challenge for the emerging nonvolatile technology of racetrack memories, and there exists a fundamental relationship between codes designed for racetrack memories and codes with constrained periodicity. Previous works have sought to construct codes that avoid periodicity in windows yet have either only provided existence proofs or required high redundancy. This paper provides the first constructions for avoiding periodicity that are both efficient (average-linear time) and with low redundancy (near the lower bound). The proposed algorithms are based on iteratively repairing windows which contain periodicity until all the windows are valid. Intuitively, such algorithms should not converge as there is no monotonic progression; yet, we prove convergence with average-linear time complexity by exploiting subtle properties of the encoder. Overall, we both provide constructions that avoid periodicity in all windows, and we also study the cardinality of such constraints.
The goal of this project is to create a data base with previously published data sets from DNA storage experiments while providing statistics on each of the presented dataset. The student will create a user-friendly interface for the data sets that provide their suggested statistical analysis together with previous results.
Multiple Sequence Alignment for DNA Storage Clusters
Multiple Sequence Alignment refers to the process of aligning a cluster of strands, usually protein or DNA, to achieve a maximal regions of similarity. More specifically, given a set S of m erroneous strands of DNA with different lengths, that assumingly originated from one common reference, the outcome of the MSA algorithm, is to be m strands with gap insertions into each strand, such that all conform to a length 𝐿≥max{𝑛𝑖||𝑆𝑖|=𝑛_𝑖}, and no index 0≤𝑖≤𝐿, yields a column consisting of only gaps, and the alignment maximizes the common substrings of the strands. MSA has been shown to be NP-complete problem. This work shows a new Sequence Alignment method, that leverage existing sequencing algorithms, in order to achieve a speedup of x80 over state of the art Multiple Sequencing Alignment algorithms. The method uses an existing pairwise alignment algorithm called FOGSAA [1], as the base of an Iterative Algorithm, along with two other phases. Further applications and Enhancements of the suggested method can further contribute to Modify the sequencing and alignment phases, in order to achieve a robust and efficient DNA data storage system.