Reconstruction of sequences refers to a large class of problems in which there are several noisy copies of the information, and the goal is to decode the information, either with small or zero error probability. The problem of using a set of erroneous sequences in order to recover the correct one falls under the framework of Levenshtein’s reconstruction problem and the trace reconstruction problem. One of the dominant motivating applications of the sequence reconstruction problems is DNA storage, where every DNA strand has several noisy copies.
In the reconstruction problem, the sequence is a codeword and the goal is to determine the minimum number of noisy copies that guarantees unique decoding in the worst case. This number must be larger than the largest intersection of two balls of any two codewords in the code, where the ball is the set of all noisy copies that can be received. Even though the minimum number of copies which guarantees unique decoding has been solved in many cases, in practice, and in particular for the DNA storage channel, we do not necessarily have control on the size of each cluster, and it is very likely that this size is significantly smaller than the required minimum size. In this case, the goal is to minimize the distance between the original sequence and the decoder’s estimation or to output a small list of words which contain the correct sequence. This talk will review the recent advances in several problems that fall under the general framework of the sequence reconstruction problem and their applicability for DNA storage.
DNA-based storage has attracted significant attention due to recent demonstrations of the viability of storing information in macromolecules. Given the trends in cost decreases of DNA synthesis and sequencing, it is estimated that within the next decade DNA storage may become a highly competitive archiving technology. This technology introduces new challenges in finding coding solutions to address various problems associated with the implementation of DNA-based storage systems. In this talk, I will present the current progress of DNA-based storage system. First, a survey will be given on the major breakthroughs in this area from the past decade. This will lead to a current overview of this technology both in the academic and industrial communities. Then, we will present a forecast to the progress of this technology and its applications. Lastly, we will present the DNA-Storalator which is a cross-platform software tool that simulates the complete process of encoding, storing, and decoding digital data in DNA molecules. This end-to-end DNA storage simulator grants researchers from all fields an accessible complete simulator to examine new biological technologies, coding techniques, and algorithms for current and future DNA storage systems.
DNA-based storage has attracted significant attention due to recent demonstrations of the viability of storing information in macromolecules. Given the trends in cost decreases of DNA synthesis and sequencing, it is estimated that within the next decade DNA storage may become a highly competitive archiving technology. This technology introduces new challenges in finding coding solutions to address various problems associated with the implementation of DNA-based storage systems. In this talk, we will present the DNA-Storalator which is a cross-platform software tool that simulates the complete process of encoding, storing, and decoding digital data in DNA molecules. The biological component includes simulation of the synthesis, PCR, and sequencing stages which are expensive and complicated and therefore are not widely accessible to the community. These processes amplify the data and generate noisy copies of each DNA strand, where the errors are insertions, deletions, long-deletions, and substitutions. The error rates are based on a comprehensive analysis of data from previous experiments but can also be customized. Additionally, the tool can analyze new datasets and characterize their error rates to build new error models for future usage in the simulator. The coding components are: 1. Clustering algorithms which partition all output noisy strands into groups according to the designed strand they originated from; 2. State-of-the-art reconstruction algorithms that are invoked on each cluster to output a close/exact estimate of the designed strand. These algorithms will be explained in more details during the talk; and 3. Integration with external error-correcting codes and other encoding and decoding techniques. This end-to-end DNA storage simulator grants researchers from all fields an accessible complete simulator to examine new biological technologies, coding techniques, and algorithms for current and future DNA storage systems.
Reconstruction of sequences refers to a large class of problems in which there are several noisy copies of the information, and the goal is to decode the information, either with small or zero error probability. The problem of using a set of erroneous sequences in order to recover the correct one falls under the framework of Levenshtein’s reconstruction problem and the trace reconstruction problem. One of the dominant motivating applications of the sequence reconstruction problems is DNA storage, where every DNA strand has several noisy copies.
In the reconstruction problem, the sequence is a codeword and the goal is to determine the minimum number of noisy copies that guarantees unique decoding in the worst case. This number must be larger than the largest intersection of two balls of any two codewords in the code, where the ball is the set of all noisy copies that can be received. Even though the minimum number of copies which guarantees unique decoding has been solved in many cases, in practice, and in particular for the DNA storage channel, we do not necessarily have control on the size of each cluster, and it is very likely that this size is significantly smaller than the required minimum size. In this case, the goal is to minimize the distance between the original sequence and the decoder’s estimation or to output a small list of words which contain the correct sequence. This talk will review the recent advances in several problems that fall under the general framework of the sequence reconstruction problem and their applicability for DNA storage.
In this model and its features, and then turn to describe two special cases of it – the convolutional sparse coding (CSC) and its multi-layered version (ML-CSC). Amazingly, as we will carefully show, ML-CSC provides a solid theoretical foundation to … deep-learning. Alongside this main message of bringing a theoretical backbone to deep-learning, another central message that will accompany us throughout the talk: Generative models for describing data sources enable a systematic way to design algorithms, while also providing a complete mechanism for a theoretical analysis of these algorithms’ performance. This talk is meant for newcomers to this field – no prior knowledge on sparse approximation is assumed.