Error-correcting codes are a method of protecting data from errors that may occur during storage or transmission. In the context of DNA storage, error-correcting codes are used to ensure that the information stored in DNA molecules is accurately retrieved and decoded. One way to achieve this is to use error-correcting codes that can detect and correct errors in the DNA sequence. These codes work by adding redundant information to the data that is being stored.
When the data is retrieved, the redundant information is used to check for errors and, if necessary, correct them. There are several different types of error-correcting codes that can be used for DNA storage, and while Reed-Solomon codes, LDPC codes, and concatenated codes were previously used, our goal is to specifically design codes that are targeted to combat edits errors of insertions, deletions, and substitutions. Overall, the use of error-correcting codes is an important part of ensuring the reliability and accuracy of DNA storage systems.
Clustering algorithms are used to group together DNA strands that are similar to each other, based on their sequence content, index (barcode), or other characteristics. Following the synthesis process, the synthesized DNA strands are all mixed and therefore are unordered. Hence, when obtaining the sequenced reads the first step of the retrieval process is to partition the unordered sequenced reads into clusters (groups) based on the origin. Our lab designs clustering algorithms and indexing solutions that support these algorithms to perform noiseless and fast DNA clustering.
Reconstruction algorithms are used to recover or reconstruct data that has several noisy estimations. Following the clustering process, a reconstruction algorithm receives each cluster of noisy DNA reads, and aims to estimate its origin strand and. Using reconstruction algorithms, it is possible to correct the different errors in the DNA strands and to reconstruct the original synthesized strand. Our lab develops several reconstruction algorithms that use methods such as dynamic programming, machine leering, deep learning, etc.
Constrained codes are applied in DNA storage systems to generate strands that satisfy several constraints that were proven to reduce the error rates and thus improve the reliability of the storage system. They are useful for DNA storage as they allow to avoid typical error mechanisms. The common constraints in DNA storage include balanced GC content, limited length of homopolymers, and more. Our lab studies constrained code and their limits for the different synthesis and sequencing methods and their requirements.
The channel capacity of DNA storage refers to the maximum amount of information that can be stored in a given amount of DNA. In general, the channel capacity of DNA storage is determined by the number of nucleotide bases that can be encoded in the DNA molecule, as well as the error-correction capabilities of the coding scheme being used. For DNA storage, the Shannon capacity can be calculated based on the length of the DNA molecule, the number of different nucleotide bases that can be encoded, and the error-correction capabilities of the coding scheme.
The processes of DNA synthesis, PCR and sequencing are all error-prone, when the errors are mostly dominated by substitutions of symbols, together with the synchronization errors of deletions and insertions. Our lab studies bounds and other theoretical results on the capacity of the DNA storage channel.
In general, DNA sequencing is the process of determining the exact order of nucleotides in a DNA strand. For DNA storage, DNA sequencing is used to read and decode the stored data. There are several different techniques that can be used for DNA sequencing, including next-generation sequencing (NGS), Sanger sequencing, and single-molecule sequencing. Once the DNA has been sequenced, the data stored in the DNA molecule can be decoded and retrieved. This may involve using reconstruction algorithms or error-correction codes to recover the data. Our lab studies algorithms for the assembly problem and for minimizing the sequencing depth.
DNA synthesis is the process of creating a DNA strand with a specific sequence of nucleotides which are used to store data. There are several different techniques that can be used for DNA synthesis, including solid-phase synthesis, phosphoramidite chemistry, and enzymatic synthesis. Each of these techniques has its own specific advantages and limitations. Our lab studies the error behavior of the different synthesis methods in order to design codes and algorithms that are used to decode the data successfully.
Our lab offers several software tools that analyze, simulate, and support the different algorithmical and biological components of DNA storage systems. These tools include the SOLQC which is a quality control tool to understand the error behavior and characteristics of DNA synthesis and sequencing, and the DNA Storalator which offers a complete end-to-end simulation of storing data in DNA together with the supporting algorithms and codes.
Optical genome mapping (OGM) of DNA is a method to obtain information about the genome sequence of a single double-stranded DNA molecule by imaging it in an optical fluorescence microscope. Each DNA molecule is labeled at specific short sequence motifs, usually up to 6 bases in length. The OGM method has been demonstrated to be useful for genome-wide mapping of effects such as DNA damage, methylation, and structural variations, as well as species identification for example in the case of bacteria typing in clinical samples. Our lab studies the theoretical information limits of optical mapping to be used in storage and in general.
In DNA storage, machine learning techniques can be used to analyze large datasets of DNA sequences and identify patterns or trends that might not be immediately apparent. Our lab uses machine leaning algorithms for clustering algorithms, classifications, and reconstruction algorithms.
Information theory is a mathematical framework for quantifying the transmission, processing, and storage of information. It provides tools to measure the amount of information in a message, determine the capacity of communication channels, and optimize data encoding. In the context of DNA storage, information theory plays a crucial role in maximizing the efficiency of encoding digital data into DNA sequences, ensuring that data can be stored compactly and retrieved accurately. By applying concepts like entropy and error correction, we can optimize how data is written into DNA, overcoming challenges related to storage density and retrieval accuracy. Our lab explores the theoretical limits of information theory to enhance DNA-based data storage, aiming to unlock its full potential for long-term, high-density information preservation.