DNA Storage Lab | Publications

Reconstruction Algorithms

Journal Papers

Dvir Ben-Shabat, Adar Hadad, Avital Boruchovsky, Prof. Eitan Yaakobi, "GradHC: highly reliable gradual hash-based clustering for DNA storage systems", Bioinformatics, Volume 40, Issue 5, May 2024.

Daniella Bar-Lev, Itai Or, Omer Sabary, Tuvi Etzion, Prof. Eitan Yaakobi, "Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and Deep Learning", Arxiv, preprint

abstractBibTeX

BibTeX

@article{bar2021deep,
title={Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory
and Deep Learning},
author={Bar-Lev, Daniella and Orr, Itai and Sabary, Omer and Etzion, Tuvi
and Yaakobi, Eitan},
journal={arXiv preprint arXiv:2109.00031},
year={2021}
}

copy to clipboard

The concept of DNA storage was first suggested in 1959 by Richard Feynman who shared his vision regarding nanotechnology in the talk “There is plenty of room at the bottom”. Later, towards the end of the 20-th century, the interest in storage solutions based on DNA molecules was increased as a result of the human genome project which in turn led to a significant progress in sequencing and assembly methods. DNA storage enjoys major advantages over the well-established magnetic and optical storage solutions. As opposed to magnetic solutions, DNA storage does not require electrical supply to maintain data integrity and is superior to other storage solutions in both density and durability. Given the trends in cost decreases of DNA synthesis and sequencing, it is now acknowledged that within the next 10-15 years DNA storage may become a highly competitive archiving technology and probably later the main such technology. With that said, the current implementations of DNA based storage systems are very limited and are not fully optimized to address the unique pattern of errors which characterize the synthesis and sequencing processes. In this work, we propose a robust, efficient and scalable solution to implement DNA-based storage systems. Our method deploys Deep Neural Networks (DNN) which reconstruct a sequence of letters based on imperfect cluster of copies generated by the synthesis and sequencing processes. A tailor-made Error-Correcting Code (ECC) is utilized to combat patterns of errors which occur during this process. Since our reconstruction method is adapted to imperfect clusters, our method overcomes the time bottleneck of the noisy DNA copies clustering process by allowing the use of a rapid and scalable pseudo-clustering instead. Our architecture combines between convolutions and transformers blocks and is trained using synthetic data modelled after real data statistics.

Omer Sabary, Alexander Yucovich, Guy Shapira, Prof. Eitan Yaakobi, "Reconstruction Algorithms for DNA - Storage Systems", bioArxiv, preprint

abstractBibTeX

BibTeX

@article {Sabary2020.09.16.300186,
author = {Sabary, Omer and Yucovich, Alexander and Shapira, Guy and
Yaakobi, Eitan},
title = {Reconstruction Algorithms for DNA
-
Storage Systems},
elocation-id ={2020.09.16.300186},
year = {2020},
doi = {10.1101/2020.09.16.300186},
publisher = {Cold Spring Harbor Laboratory}

copy to clipboard

In the trace reconstruction problem a length-n string x yields a collection of noisy copies, called traces, y1, …, yt where each y i is independently obtained from x by passing through a deletion channel, which deletes every symbol with some fixed probability. The main goal under this paradigm is to determine the required minimum number of i.i.d traces in order to reconstruct x with high probability. The trace reconstruction problem can be extended to the model where each trace is a result of x passing through a deletion-insertion-substitution
channel, which introduces also insertions and substitutions. Motivated by the storage channel of DNA, this work is focused on another variation of the trace reconstruction problem, which is referred by the DNA reconstruction problem. A DNA reconstruction algorithm is a mapping which receives ttraces y1, …, yt as an input and produces, an estimation of x. The goal in the DNA reconstruction problem is to minimize the edit distance between the original string and the algorithm’s estimation. For the deletion channel case, the problem is referred by the deletion DNA reconstruction problem and the goal is to minimize the Levenshtein distance. In this work, we present several new algorithms for these reconstruction problems. Our algorithms look globally on the entire sequence of the traces and use dynamic programming algorithms, which are used for the shortest common super sequence and the longest common subsequence problems, in order to decode the original sequence. Our algorithms do not require any limitations on the input and the number of traces, and more than that, they perform well even for error probabilities as high as 0.27. The algorithms have been tested on simulated data as well as on data from previous DNA experiments and are shown to outperform all previous algorithms.

Roy Shafir, Omer Sabary, Leon Anavy, Prof. Eitan Yaakobi, Zohar Yakhini, "Sequence design and reconstruction under the repeat channel in enzymatic DNA synthesis‏", IEEE Transactions on Communications

abstractBibTeX

BibTeX

@ARTICLE{10313138,
author={Shafir, Roy and Sabary, Omer and Anavy, Leon and Yaakobi, Eitan and Yakhini, Zohar},
journal={IEEE Transactions on Communications},
title={Sequence Design and Reconstruction Under the Repeat Channel in Enzymatic DNA Synthesis},
year={2024},
volume={72},
number={2},
pages={675-691},
keywords={DNA;Noise measurement;Europe;Encoding;Reconstruction algorithms;Random variables;Memory;DNA-based informatics;DNA-based data storage;deletion channel;insertion channel;sequence reconstruction;enzymatic DNA synthesis;nucleic acid sequencing;alternating-length limited codes},
doi={10.1109/TCOMM.2023.3330848}}

copy to clipboard

Using synthetic DNA for data storage and for physical information encoding in labeling, tracing, and authentication applications is becoming more feasible as synthesis and reading technologies are improving. DNA in data storage applications has several advantages such as very high physical density and robustness. Some of the new synthesis technologies lead to repetition noise, consisting of sticky insertions and deletions in the resulting messages. In this paper, we address reconstruction algorithms for multiple trace communication channels with repetition (sticky insertion and deletion) noise. We prove correctness and analyze failure rates, both analytically and on simulated data. We identify a failure mechanism related to alternating stretches in the design sequence that leads to a potential bias in the data derived from reads (traces) and used for reconstruction. To minimize this effect we introduce alternating …

Conferences

Shubhransh Singhvi, Avital Boruchovsky, Han-Mao Kiah, Prof. Eitan Yaakobi, "Data-Driven Bee Identification for DNA Strands", International Symposium on Information Theory (ISIT2023)

Daniella Bar-Lev, Sagi Markovich, Prof. Eitan Yaakobi, Yonatan Yehezkeally, "Adversarial Torn - paper Codes", IEEE International Symposium on Information Theory (ISIT), 2022, pp. 2934 – 2939

abstractBibTeX

BibTeX

@INPROCEEDINGS{9834766,
author={Bar-Lev, Daniella and Marcovich, Sagi and Yaakobi, Eitan and Yehezkeally, Yonatan},
booktitle={2022 IEEE International Symposium on Information Theory (ISIT)},
title={Adversarial Torn
-
paper Codes},
year={2022},
volume={},
number={},
pages={2934-2939},
doi={10.1109/ISIT50566.2022.9834766}}

copy to clipboard

This paper studies the adversarial torn-paper channel. This problem is motivated by applications in DNA data storage where the DNA strands that carry the information may break into smaller pieces that are received out of order. Our model extends the previously researched probabilistic setting to the worst-case. We deve lop code constructions for any parameters of the channel for which non-vanishing asymptotic rate is possible and show that our constructions achieve optimal asymptotic rate while allowing for efficient encoding and decoding. Finally, we extend our results to related settings included multi-strand storage, presence of substitution errors, or incomplete coverage.

Clustering Algorithms

Journal Papers

Avital Boruchovsky, Daniella Bar-Lev, Prof. Eitan Yaakobi, "DNA-Correcting Codes: End-to-end Correction in DNA Storage Systems", IEEE Transactions on Information Theory.

Dvir Ben-Shabat, Adar Hadad, Avital Boruchovsky, Prof. Eitan Yaakobi, "GradHC: highly reliable gradual hash-based clustering for DNA storage systems", Bioinformatics, Volume 40, Issue 5, May 2024.

Daniella Bar-Lev, Itai Or, Omer Sabary, Tuvi Etzion, Prof. Eitan Yaakobi, "Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and Deep Learning", Arxiv, preprint

abstractBibTeX

BibTeX

@article{bar2021deep,
title={Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory
and Deep Learning},
author={Bar-Lev, Daniella and Orr, Itai and Sabary, Omer and Etzion, Tuvi
and Yaakobi, Eitan},
journal={arXiv preprint arXiv:2109.00031},
year={2021}
}

copy to clipboard

The concept of DNA storage was first suggested in 1959 by Richard Feynman who shared his vision regarding nanotechnology in the talk “There is plenty of room at the bottom”. Later, towards the end of the 20-th century, the interest in storage solutions based on DNA molecules was increased as a result of the human genome project which in turn led to a significant progress in sequencing and assembly methods. DNA storage enjoys major advantages over the well-established magnetic and optical storage solutions. As opposed to magnetic solutions, DNA storage does not require electrical supply to maintain data integrity and is superior to other storage solutions in both density and durability. Given the trends in cost decreases of DNA synthesis and sequencing, it is now acknowledged that within the next 10-15 years DNA storage may become a highly competitive archiving technology and probably later the main such technology. With that said, the current implementations of DNA based storage systems are very limited and are not fully optimized to address the unique pattern of errors which characterize the synthesis and sequencing processes. In this work, we propose a robust, efficient and scalable solution to implement DNA-based storage systems. Our method deploys Deep Neural Networks (DNN) which reconstruct a sequence of letters based on imperfect cluster of copies generated by the synthesis and sequencing processes. A tailor-made Error-Correcting Code (ECC) is utilized to combat patterns of errors which occur during this process. Since our reconstruction method is adapted to imperfect clusters, our method overcomes the time bottleneck of the noisy DNA copies clustering process by allowing the use of a rapid and scalable pseudo-clustering instead. Our architecture combines between convolutions and transformers blocks and is trained using synthetic data modelled after real data statistics.

Conferences

Shubhransh Singhvi, Avital Boruchovsky, Han-Mao Kiah, Prof. Eitan Yaakobi, "Data-Driven Bee Identification for DNA Strands", International Symposium on Information Theory (ISIT2023)

Avital Boruchovsky, Daniella Bar-Lev, Prof. Eitan Yaakobi, "DNA-Correcting Codes: End-to-end Correction in DNA Storage Systems", International Symposium on Information Theory (ISIT2023)

Daniella Bar-Lev, Sagi Markovich, Prof. Eitan Yaakobi, Yonatan Yehezkeally, "Adversarial Torn - paper Codes", IEEE International Symposium on Information Theory (ISIT), 2022, pp. 2934 – 2939

abstractBibTeX

BibTeX

@INPROCEEDINGS{9834766,
author={Bar-Lev, Daniella and Marcovich, Sagi and Yaakobi, Eitan and Yehezkeally, Yonatan},
booktitle={2022 IEEE International Symposium on Information Theory (ISIT)},
title={Adversarial Torn
-
paper Codes},
year={2022},
volume={},
number={},
pages={2934-2939},
doi={10.1109/ISIT50566.2022.9834766}}

copy to clipboard

This paper studies the adversarial torn-paper channel. This problem is motivated by applications in DNA data storage where the DNA strands that carry the information may break into smaller pieces that are received out of order. Our model extends the previously researched probabilistic setting to the worst-case. We deve lop code constructions for any parameters of the channel for which non-vanishing asymptotic rate is possible and show that our constructions achieve optimal asymptotic rate while allowing for efficient encoding and decoding. Finally, we extend our results to related settings included multi-strand storage, presence of substitution errors, or incomplete coverage.

Error-Correcting Codes

Journal Papers

Avital Boruchovsky, Daniella Bar-Lev, Prof. Eitan Yaakobi, "DNA-Correcting Codes: End-to-end Correction in DNA Storage Systems", IEEE Transactions on Information Theory.

Boaz Moav, Ryan Gabrys, Prof. Eitan Yaakobi, "Tail-Erasure-Correcting Codes‏", IEEE Transactions on Information Theory‏.

Daniella Bar-Lev, Itai Or, Omer Sabary, Tuvi Etzion, Prof. Eitan Yaakobi, "Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and Deep Learning", Arxiv, preprint

abstractBibTeX

BibTeX

@article{bar2021deep,
title={Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory
and Deep Learning},
author={Bar-Lev, Daniella and Orr, Itai and Sabary, Omer and Etzion, Tuvi
and Yaakobi, Eitan},
journal={arXiv preprint arXiv:2109.00031},
year={2021}
}

copy to clipboard

The concept of DNA storage was first suggested in 1959 by Richard Feynman who shared his vision regarding nanotechnology in the talk “There is plenty of room at the bottom”. Later, towards the end of the 20-th century, the interest in storage solutions based on DNA molecules was increased as a result of the human genome project which in turn led to a significant progress in sequencing and assembly methods. DNA storage enjoys major advantages over the well-established magnetic and optical storage solutions. As opposed to magnetic solutions, DNA storage does not require electrical supply to maintain data integrity and is superior to other storage solutions in both density and durability. Given the trends in cost decreases of DNA synthesis and sequencing, it is now acknowledged that within the next 10-15 years DNA storage may become a highly competitive archiving technology and probably later the main such technology. With that said, the current implementations of DNA based storage systems are very limited and are not fully optimized to address the unique pattern of errors which characterize the synthesis and sequencing processes. In this work, we propose a robust, efficient and scalable solution to implement DNA-based storage systems. Our method deploys Deep Neural Networks (DNN) which reconstruct a sequence of letters based on imperfect cluster of copies generated by the synthesis and sequencing processes. A tailor-made Error-Correcting Code (ECC) is utilized to combat patterns of errors which occur during this process. Since our reconstruction method is adapted to imperfect clusters, our method overcomes the time bottleneck of the noisy DNA copies clustering process by allowing the use of a rapid and scalable pseudo-clustering instead. Our architecture combines between convolutions and transformers blocks and is trained using synthetic data modelled after real data statistics.

Omer Sabary, Yoav Orlev, Roi Shafir, Leon Anavy, Prof. Eitan Yaakobi, "SOLQC: Synthetic Oligo Library Quality Control Tool", Bioinformatics, Volume 37, Issue 5, 1 March 2021, Pages 720 – 722

abstractBibTeX

BibTeX

@article{10.1093/bioinformatics/btaa740,
author = {Sabary, Omer and Orlev, Yoav and Shafir, Roy and Anavy, Leon
and Yaakobi, Eitan and Yakhini, Zohar},
title = "{SOLQC: Synthetic Oligo Library Quality Control tool}",
journal = {Bioinformatics},
volume = {37},
number = {5},
pages = {720-722},
year = {2020},
month = {08}

copy to clipboard

Motivation Recent years have seen a growing number and an expanding scope of studies using synthetic oligo libraries for a range of applications in synthetic biology. As experiments are growing by numbers and complexity, analysis tools can facilitate quality control and support better assessment and inference. Results We present a novel analysis tool, called SOLQC, which enables fast and comprehensive analysis of synthetic oligo libraries, based on NGS analysis performed by the user. SOLQC provides statistical information such as the distribution of variant representation, different error rates and their dependence on sequence or library properties. SOLQC produces graphical reports from the analysis, in a flexible format. We demonstrate SOLQC by analyzing literature libraries. We also discuss the potential benefits and relevance of the different components of the analysis.

Conferences

Bes Dollma, Ohad Elishco, Prof. Eitan Yaakobi, "Coding for Ordered Composite DNA Sequences", International Symposium on Information Theory (ISIT2025)

Dganit Hanania, Prof. Eitan Yaakobi, "Error-Correcting Codes for Labeled DNA Sequences", International Symposium on Information Theory (ISIT2025).

abstract

Labeling of DNA molecules is a fundamental technique for DNA visualization and analysis. This process was mathematically modeled in [1], where the received sequence indicates the positions of the used labels. In this work, we develop error correcting codes for labeled DNA sequences, establishing bounds and constructing explicit systematic encoders for single substitution, insertion, and deletion errors. We focus on two cases: (1) using the complete set of length-two labels and (2) using the minimal set of length-two labels that ensures the recovery of DNA sequences from their labeling for ‘almost’ all DNA sequences.

Roni Con, Ryan Gabrys, Prof. Eitan Yaakobi, "One Code Fits All: Strong Stuck-at Codes for Versatile Memory Encoding", International Symposium on Information Theory (ISIT2024)

abstract

In this work we consider a generalization of the well-studied problem of coding for “stuck-at” errors, which we refer to as “strong stuck-at” codes. In the traditional framework of stuck-at codes, the task involves encoding a message into a one-dimensional binary vector. However, a certain number of the bits in this vector are `frozen’, meaning they are fixed at a predetermined value and cannot be altered by the encoder. The decoder, aware of the proportion of frozen bits but not their specific positions, is responsible for deciphering the intended message. We consider a more challenging version of this problem where the decoder does not even know the fraction of frozen bits. We construct explicit and efficient encoding and decoding algorithms that get arbitrarily close to capacity in this scenario. Furthermore, to the best of our knowledge, our construction is the first fully explicit construction of stuck-at codes that approaches capacity.

Shubhransh Singhvi, Roni Con, Han-Mao Kiah, Prof. Eitan Yaakobi, "An Optimal Sequence Reconstruction Algorithm for Reed-Solomon Codes", International Symposium on Information Theory (ISIT2024)

abstract

The sequence reconstruction problem, introduced by Levenshtein in 2001, considers a scenario where the sender transmits a codeword from some codebook, and the receiver obtains N noisy outputs of the codeword. We study the problem of efficient reconstruction using N outputs that are corrupted by substitutions. Specifically, for the ubiquitous Reed-Solomon codes, we adapt the Koetter-Vardy soft-decoding algorithm, pre- senting a reconstruction algorithm capable of correcting beyond Johnson radius. Furthermore, the algorithm uses O(n N) field operations, where n is the codeword length.

Hadas Abraham, Rayn Gabrys, Prof. Eitan Yaakobi, "Covering All Bases: The Next Inning in DNA Sequencing Efficiency", International Symposium on Information Theory (ISIT2024)

abstract

DNA emerges as a promising medium for the exponential growth of digital data due to its density and durability. This study extends recent research by addressing the coverage depth problem in practical scenarios, exploring optimal error-correcting code pairings with DNA storage systems to minimize coverage depth. Conducted within random access settings, the study provides theoretical analyses and experimental simulations to examine the expectation and probability distribution of samples needed for files recovery. Structured into sections covering definitions, analyses, lower bounds, and comparative evaluations of coding schemes, the paper unveils insights into effective coding schemes for optimizing DNA storage systems.

Avital Boruchovsky, Daniella Bar-Lev, Prof. Eitan Yaakobi, "DNA-Correcting Codes: End-to-end Correction in DNA Storage Systems", International Symposium on Information Theory (ISIT2023)

Constrained Codes

Journal Papers

Dganit Hanania, Daniella Bar-Lev, Yevgeni Nogin, Yoav Shechtman, Prof. Eitan Yaakobi, "On the Capacity of DNA Labeling", IEEE Transactions on Information Theory.

abstractBibTeX

BibTeX

@ARTICLE{10910086,
author={Hanania, Dganit and Bar-Lev, Daniella and Nogin, Yevgeni and Shechtman, Yoav and Yaakobi, Eitan},
journal={IEEE Transactions on Information Theory},
title={On the Capacity of DNA Labeling},
year={2025},
volume={71},
number={5},
pages={3457-3472},
keywords={Labeling;DNA;Europe;Sequential analysis;Visualization;Proteins;Microbiology;Information rates;Genomics;Fluorescence;DNA labeling;channel capacity;constrained systems},
doi={10.1109/TIT.2025.3545662}}

copy to clipboard

DNA labeling is a powerful tool in molecular biology and biotechnology that allows for the visualization, detection, and study of DNA at the molecular level. Under this paradigm, a DNA molecule is being labeled by specific k patterns and is then imaged. Then, the resulting image is modeled as a (k+1) -ary sequence in which any non-zero symbol indicates on the appearance of the corresponding label in the DNA molecule. The primary goal of this work is to study the labeling capacity, which is defined as the maximal information rate that can be obtained using this labeling process. The labeling capacity is computed for almost any pattern of a single label and several results for multiple labels are provided as well. Moreover, we provide the optimal minimal number of labels of length one or two, over any alphabet of size q, that are needed in order to achieve the maximum labeling capacity of log2(q) . Lastly, we discuss the maximal labeling capacity that can be achieved using a certain number of labels of length two.

Avital Boruchovsky, Prof. Tuvi Etzion, Ron M. Roth, "On Nearly Perfect Covering Codes", IEEE Transactions on Information Theory.

Daniella Bar-Lev, Itai Or, Omer Sabary, Tuvi Etzion, Prof. Eitan Yaakobi, "Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and Deep Learning", Arxiv, preprint

abstractBibTeX

BibTeX

@article{bar2021deep,
title={Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory
and Deep Learning},
author={Bar-Lev, Daniella and Orr, Itai and Sabary, Omer and Etzion, Tuvi
and Yaakobi, Eitan},
journal={arXiv preprint arXiv:2109.00031},
year={2021}
}

copy to clipboard

The concept of DNA storage was first suggested in 1959 by Richard Feynman who shared his vision regarding nanotechnology in the talk “There is plenty of room at the bottom”. Later, towards the end of the 20-th century, the interest in storage solutions based on DNA molecules was increased as a result of the human genome project which in turn led to a significant progress in sequencing and assembly methods. DNA storage enjoys major advantages over the well-established magnetic and optical storage solutions. As opposed to magnetic solutions, DNA storage does not require electrical supply to maintain data integrity and is superior to other storage solutions in both density and durability. Given the trends in cost decreases of DNA synthesis and sequencing, it is now acknowledged that within the next 10-15 years DNA storage may become a highly competitive archiving technology and probably later the main such technology. With that said, the current implementations of DNA based storage systems are very limited and are not fully optimized to address the unique pattern of errors which characterize the synthesis and sequencing processes. In this work, we propose a robust, efficient and scalable solution to implement DNA-based storage systems. Our method deploys Deep Neural Networks (DNN) which reconstruct a sequence of letters based on imperfect cluster of copies generated by the synthesis and sequencing processes. A tailor-made Error-Correcting Code (ECC) is utilized to combat patterns of errors which occur during this process. Since our reconstruction method is adapted to imperfect clusters, our method overcomes the time bottleneck of the noisy DNA copies clustering process by allowing the use of a rapid and scalable pseudo-clustering instead. Our architecture combines between convolutions and transformers blocks and is trained using synthetic data modelled after real data statistics.

Conferences

Avital Boruchovsky, Prof. Tuvi Etzion, Prof. Eitan Yaakobi, "Pair-Covering Codes", International Symposium on Information Theory (ISIT2024)

Daniella Bar-Lev, Adir Kobovich, Orian Leitersdorf, Prof. Eitan Yaakobi, "Optimal Almost-Balanced Sequences", International Symposium on Information Theory (ISIT2024)

abstractBibTeX

BibTeX

@INPROCEEDINGS{bar2024balance,
author={Bar-Lev, Daniella and Kobovich, Adir and Leitersdorf, Orian and Yaakobi, Eitan},
booktitle={2024 IEEE International Symposium on Information Theory (ISIT)},
title={Optimal Almost-Balanced Sequences},
year={2024},
pages={2628-2633},
doi={10.1109/ISIT57864.2024.10619424}}

copy to clipboard

This paper presents a novel approach to address the constrained coding challenge of generating almost-balanced sequences. While strictly balanced sequences have been well studied in the past, the problem of designing efficient algorithms with small redundancy, preferably constant or even a single bit, for almost balanced sequences has remained unsolved. A sequence is ε(n)-almost balanced if its Hamming weight is between 0.5n±ε(n). It is known that for any algorithm with a constant number of bits, ε(n) has to be in the order of Θ(√n), with O(n) average time complexity. However, prior solutions with a single redundancy bit required ε(n) to be a linear shift from n/2. Employing an iterative method and arithmetic coding, our emphasis lies in constructing almost balanced codes with a single redundancy bit. Notably, our method surpasses previous approaches by achieving the optimal balanced order of Θ(√n). Additionally, we extend our method to the non-binary case considering q-ary almost polarity-balanced sequences for even q, and almost symbol-balanced for q=4. Our work marks the first asymptotically optimal solutions for almost-balanced sequences, for both, binary and non-binary alphabet.

Omer Yerushalmi, Prof. Tuvi Etzion, Prof. Eitan Yaakobi, "The Capacity of the Weighted Read Channel", International Symposium on Information Theory (ISIT2024)

Adir Kobovich, Orian Leitersdorf, Daniella Bar-Lev, Prof. Eitan Yaakobi, "Universal Framework for Parametric Constrained Coding", International Symposium on Information Theory (ISIT2024)

abstractBibTeX

BibTeX

@inproceedings{kobovich2024universal,
title={Universal Framework for Parametric Constrained Coding},
author={Kobovich, Adir and Leitersdorf, Orian and Bar-Lev, Daniella and Yaakobi, Eitan},
booktitle={2024 IEEE International Symposium on Information Theory (ISIT)},
pages={1023--1028},
year={2024},
organization={IEEE},
doi={10.1109/ISIT57864.2024.10619700}}

copy to clipboard

Constrained coding is a fundamental field in coding theory that tackles efficient communication through constrained channels. While fixed constraints (e.g., a fixed set of substrings may not appear in transmitted messages) have a general optimal solution, there is increasing demand for supporting parametric constraints that are dependent on the message length and portray some property that the substrings must satisfy (e.g., no log (n) consecutive zeros). Several works have tackled such parametric constraints through iterative algorithms following the sequence-replacement approach, yet this approach requires complex constraint-specific properties to guarantee convergence through monotonic progression. In this paper, we propose a universal framework for tackling any parametric constraint problem with far fewer requirements, through a simple iterative algorithm. By reducing an execution of this iterative algorithm to an acyclic graph traversal, we prove a surprising result that guarantees convergence with efficient average time complexity even without requiring any monotonic progression. We demonstrate how to apply this algorithm to the run-length-limited, minimal Hamming weight, local almost-balanced Hamming weight constraints, as well as repeat-free and secondary-structure constraints. Overall, this framework enables state-of-the-art results with minimal effort.

Dganit Hanania, Daniella Bar-Lev, Yevgeni Nogin, Yoav Shechtman, Prof. Eitan Yaakobi, "On the Capacity of DNA Labeling", International Symposium on Information Theory (ISIT2023)

abstractBibTeX

BibTeX

@INPROCEEDINGS{10206769,
author={Hanania, Dganit and Bar-Lev, Daniella and Nogin, Yevgeni and Shechtman, Yoav and Yaakobi, Eitan},
booktitle={2023 IEEE International Symposium on Information Theory (ISIT)},
title={On the Capacity of DNA Labeling},
year={2023},
volume={},
number={},
pages={567-572},
keywords={Visualization;Biotechnology;Biological system modeling;DNA;Symbols;Genetic communication;Labeling},
doi={10.1109/ISIT54713.2023.10206769}}

copy to clipboard

DNA labeling is a powerful tool in molecular biology and biotechnology that allows for the visualization, detection, and study of DNA at the molecular level. Under this paradigm, a DNA molecule is being labeled by specific k patterns and is then imaged. Then, the resulted image is modeled as a (k +1)-ary sequence in which any non-zero symbol indicates on the appearance of the corresponding label in the DNA molecule. The primary goal of this work is to study the labeling capacity, which is defined as the maximal information rate that can be obtained using this labeling process. The labeling capacity is computed for any single label and several results are provided for multiple labels as well. Moreover, we provide the optimal minimal number of labels of length one or two that are needed in order to gain labeling capacity of 2.

Adir Kobovich, Orian Leitersdorf, Daniella Bar-Lev, Prof. Eitan Yaakobi, "Codes for Constrained Periodicity", International Symposium on Information Theory and Its Applications (ISITA2022), awarded

abstractBibTeX

BibTeX

@INPROCEEDINGS{kobovich2022periodicity,
author={Kobovich, Adir and Leitersdorf, Orian and Bar-Lev, Daniella and Yaakobi, Eitan},
booktitle={2022 International Symposium on Information Theory and Its Applications (ISITA)},
title={Codes for Constrained Periodicity},
year={2022},
pages={79-83}}

copy to clipboard

Reliability is an inherent challenge for the emerging nonvolatile technology of racetrack memories, and there exists a fundamental relationship between codes designed for racetrack memories and codes with constrained periodicity. Previous works have sought to construct codes that avoid periodicity in windows, yet have either only provided existence proofs or required high redundancy. This paper provides the first constructions for avoiding periodicity that are both efficient (average-linear time) and with low redundancy (near the lower bound). The proposed algorithms are based on iteratively repairing windows which contain periodicity until all the windows are valid. Intuitively, such algorithms should not converge as there is no monotonic progression; yet, we prove convergence with average-linear time complexity by exploiting subtle properties of the encoder. Overall, we both provide constructions that avoid periodicity in all windows, and we also study the cardinality of such constraints.

Channel Capacity

Journal Papers

Adir Kobovich, Nir Weinberger, "Input Optimization in the Composite DNA Storage Channel", IEEE Journal on Selected Areas in Information Theory.

abstractBibTeX

BibTeX

@ARTICLE{11107232,
author={Kobovich, Adir and Weinberger, Nir},
journal={IEEE Journal on Selected Areas in Information Theory},
title={Input Optimization in the Composite DNA Storage Channel},
year={2025},
volume={6},
number={},
pages={248-260},
doi={10.1109/JSAIT.2025.3595005}}

copy to clipboard

Recent advancements in DNA storage show that composite DNA letters can significantly enhance storage capacity. We model this process as a multinomial channel and propose an optimization algorithm to determine its capacity-achieving input distribution (CAID) for an arbitrary number of output reads. Our empirical results match a scaling law that determines that the support size grows exponentially with capacity. In addition, we introduce a limited-support optimization algorithm that optimizes the input distribution under a restricted support size, making it more feasible for real-world DNA storage systems. We also extend our model to account for noise and study its effect on capacity and input design.

Omer Sabary, Yoav Orlev, Roi Shafir, Leon Anavy, Prof. Eitan Yaakobi, "SOLQC: Synthetic Oligo Library Quality Control Tool", Bioinformatics, Volume 37, Issue 5, 1 March 2021, Pages 720 – 722

abstractBibTeX

BibTeX

@article{10.1093/bioinformatics/btaa740,
author = {Sabary, Omer and Orlev, Yoav and Shafir, Roy and Anavy, Leon
and Yaakobi, Eitan and Yakhini, Zohar},
title = "{SOLQC: Synthetic Oligo Library Quality Control tool}",
journal = {Bioinformatics},
volume = {37},
number = {5},
pages = {720-722},
year = {2020},
month = {08}

copy to clipboard

Motivation Recent years have seen a growing number and an expanding scope of studies using synthetic oligo libraries for a range of applications in synthetic biology. As experiments are growing by numbers and complexity, analysis tools can facilitate quality control and support better assessment and inference. Results We present a novel analysis tool, called SOLQC, which enables fast and comprehensive analysis of synthetic oligo libraries, based on NGS analysis performed by the user. SOLQC provides statistical information such as the distribution of variant representation, different error rates and their dependence on sequence or library properties. SOLQC produces graphical reports from the analysis, in a flexible format. We demonstrate SOLQC by analyzing literature libraries. We also discuss the potential benefits and relevance of the different components of the analysis.

Omer Sabary, Alexander Yucovich, Guy Shapira, Prof. Eitan Yaakobi, "Reconstruction Algorithms for DNA - Storage Systems", bioArxiv, preprint

abstractBibTeX

BibTeX

@article {Sabary2020.09.16.300186,
author = {Sabary, Omer and Yucovich, Alexander and Shapira, Guy and
Yaakobi, Eitan},
title = {Reconstruction Algorithms for DNA
-
Storage Systems},
elocation-id ={2020.09.16.300186},
year = {2020},
doi = {10.1101/2020.09.16.300186},
publisher = {Cold Spring Harbor Laboratory}

copy to clipboard

In the trace reconstruction problem a length-n string x yields a collection of noisy copies, called traces, y1, …, yt where each y i is independently obtained from x by passing through a deletion channel, which deletes every symbol with some fixed probability. The main goal under this paradigm is to determine the required minimum number of i.i.d traces in order to reconstruct x with high probability. The trace reconstruction problem can be extended to the model where each trace is a result of x passing through a deletion-insertion-substitution
channel, which introduces also insertions and substitutions. Motivated by the storage channel of DNA, this work is focused on another variation of the trace reconstruction problem, which is referred by the DNA reconstruction problem. A DNA reconstruction algorithm is a mapping which receives ttraces y1, …, yt as an input and produces, an estimation of x. The goal in the DNA reconstruction problem is to minimize the edit distance between the original string and the algorithm’s estimation. For the deletion channel case, the problem is referred by the deletion DNA reconstruction problem and the goal is to minimize the Levenshtein distance. In this work, we present several new algorithms for these reconstruction problems. Our algorithms look globally on the entire sequence of the traces and use dynamic programming algorithms, which are used for the shortest common super sequence and the longest common subsequence problems, in order to decode the original sequence. Our algorithms do not require any limitations on the input and the number of traces, and more than that, they perform well even for error probabilities as high as 0.27. The algorithms have been tested on simulated data as well as on data from previous DNA experiments and are shown to outperform all previous algorithms.

Conferences

Adir Kobovich, Prof. Eitan Yaakobi, Nir Weinberger, "DeepDIVE: Optimizing Input-Constrained Distributions for Composite DNA Storage via Multinomial Channel", International Symposium on Information Theory (ISIT2025)

abstractBibTeX

BibTeX

@article{kobovich2025deepdive,
title={{DeepDIVE}: Optimizing Input-Constrained Distributions for Composite {DNA} Storage via Multinomial Channel},
author={Kobovich, Adir and Yaakobi, Eitan and Weinberger, Nir},
journal={arXiv preprint arXiv:2501.15172},
year={2025}
}

copy to clipboard

We address the challenge of optimizing the capacity-achieving input distribution for a multinomial channel under the constraint of limited input support size, which is a crucial aspect in the design of DNA storage systems. We propose an algorithm that further elaborates the Multidimensional Dynamic Assignment Blahut-Arimoto (M-DAB) algorithm. Our proposed algorithm integrates variational autoencoder for determining the optimal locations of input distribution, into the alternating optimization of the input distribution locations and weights.

Omer Yerushalmi, Prof. Tuvi Etzion, Prof. Eitan Yaakobi, "The Capacity of the Weighted Read Channel", International Symposium on Information Theory (ISIT2024)

Software and Tools

Adir Kobovich, Nir Weinberger, "Input Optimization in the Composite DNA Storage Channel", Journal on Selected Areas in Information Theory.

abstractBibTeX

BibTeX

@misc{composite_code,
title = {Input Optimization in the Composite DNA Storage Channel},
author = {Adir Kobovich and Nir Weinberger},
journal = {Journal on Selected Areas in Information Theory},
doi = {10.24433/CO.5268435.v1},
howpublished = {\url{https://www.codeocean.com/}},
year = 2025,
month = {6},
version = {v1}
}

copy to clipboard

DNA Sequencing

Conferences

Tomer Cohen, Prof. Eitan Yaakobi, "Optimizing the Decoding Probability and Coverage Ratio of Composite DNA",

International Symposium on Information Theory (ISIT2024)

abstract

This paper studies two problems that are motivated by the novel recent approach of composite DNA that takes advantage of the DNA synthesis property which generates a huge number of copies for every synthesized strand. Under this paradigm, every composite symbols does not store a single nucleotide but a mixture of the four DNA nucleotides. In the first problem, our goal is study how to carefully choose a fixed number of mixtures of the DNA nucleotides such that the decoding probability by the maximum likelihood decoder is maximized. The second problem studies the expected number of strand reads in order to decode a composite strand or a group of composite strands.

Daniella Bar-Lev, Sagi Markovich, Prof. Eitan Yaakobi, Yonatan Yehezkeally, "Adversarial Torn - paper Codes", IEEE International Symposium on Information Theory (ISIT), 2022, pp. 2934 – 2939

abstractBibTeX

BibTeX

@INPROCEEDINGS{9834766,
author={Bar-Lev, Daniella and Marcovich, Sagi and Yaakobi, Eitan and Yehezkeally, Yonatan},
booktitle={2022 IEEE International Symposium on Information Theory (ISIT)},
title={Adversarial Torn
-
paper Codes},
year={2022},
volume={},
number={},
pages={2934-2939},
doi={10.1109/ISIT50566.2022.9834766}}

copy to clipboard

This paper studies the adversarial torn-paper channel. This problem is motivated by applications in DNA data storage where the DNA strands that carry the information may break into smaller pieces that are received out of order. Our model extends the previously researched probabilistic setting to the worst-case. We deve lop code constructions for any parameters of the channel for which non-vanishing asymptotic rate is possible and show that our constructions achieve optimal asymptotic rate while allowing for efficient encoding and decoding. Finally, we extend our results to related settings included multi-strand storage, presence of substitution errors, or incomplete coverage.

Daniella Bar-Lev, Omer Sabary, Ryan Gabrys, Prof. Eitan Yaakobi, "Cover Your Bases: How to Minimize the Sequencing Coverage in DNA Storage Systems", IEEE International Symposium on Information Theory (ISIT) 2023.

abstractBibTeX

BibTeX

@INPROCEEDINGS{10206882,
author={Bar-Lev, Daniella and Sabary, Omer and Gabrys, Ryan and Yaakobi, Eitan},
booktitle={2023 IEEE International Symposium on Information Theory (ISIT)},
title={Cover Your Bases: How to Minimize the Sequencing Coverage in DNA Storage Systems},
year={2023},
volume={},
number={},
pages={370-375},
keywords={Sequential analysis;Costs;DNA;Genetic communication;Error correction codes},
doi={10.1109/ISIT54713.2023.10206882}}

copy to clipboard

Although the expenses associated with DNA sequencing have been rapidly decreasing, the current cost stands at roughly 1.3K/TB, which is dramatically more expensive than reading from existing archival storage solutions today. In this work, we aim to reduce not only the cost but also the latency of DNA storage by studying the DNA coverage depth problem, which aims to reduce the required number of reads to retrieve information from the storage system. Under this framework, our main goal is to understand how to optimally pair an error-correcting code with a given retrieval algorithm to minimize the sequencing coverage depth, while guaranteeing retrieval of the information with high probability. Additionally, we study the DNA coverage depth problem under the random-access setup.

DNA Synthesis

Conferences

Boaz Moav, Ryan Gabrys, Prof. Eitan Yaakobi, "Complex DNA Synthesis Sequences", International Symposium on Information Theory (ISIT2025).

Software Tools

Journal Papers

Omer Sabary, Yoav Orlev, Roi Shafir, Leon Anavy, Prof. Eitan Yaakobi, "SOLQC: Synthetic Oligo Library Quality Control Tool", Bioinformatics, Volume 37, Issue 5, 1 March 2021, Pages 720 – 722

abstractBibTeX

BibTeX

@article{10.1093/bioinformatics/btaa740,
author = {Sabary, Omer and Orlev, Yoav and Shafir, Roy and Anavy, Leon
and Yaakobi, Eitan and Yakhini, Zohar},
title = "{SOLQC: Synthetic Oligo Library Quality Control tool}",
journal = {Bioinformatics},
volume = {37},
number = {5},
pages = {720-722},
year = {2020},
month = {08}

copy to clipboard

Motivation Recent years have seen a growing number and an expanding scope of studies using synthetic oligo libraries for a range of applications in synthetic biology. As experiments are growing by numbers and complexity, analysis tools can facilitate quality control and support better assessment and inference. Results We present a novel analysis tool, called SOLQC, which enables fast and comprehensive analysis of synthetic oligo libraries, based on NGS analysis performed by the user. SOLQC provides statistical information such as the distribution of variant representation, different error rates and their dependence on sequence or library properties. SOLQC produces graphical reports from the analysis, in a flexible format. We demonstrate SOLQC by analyzing literature libraries. We also discuss the potential benefits and relevance of the different components of the analysis.

Optical Mapping

Journal Papers

Dganit Hanania, Daniella Bar-Lev, Yevgeni Nogin, Yoav Shechtman, Prof. Eitan Yaakobi, "On the Capacity of DNA Labeling", IEEE Transactions on Information Theory.

abstractBibTeX

BibTeX

@ARTICLE{10910086,
author={Hanania, Dganit and Bar-Lev, Daniella and Nogin, Yevgeni and Shechtman, Yoav and Yaakobi, Eitan},
journal={IEEE Transactions on Information Theory},
title={On the Capacity of DNA Labeling},
year={2025},
volume={71},
number={5},
pages={3457-3472},
keywords={Labeling;DNA;Europe;Sequential analysis;Visualization;Proteins;Microbiology;Information rates;Genomics;Fluorescence;DNA labeling;channel capacity;constrained systems},
doi={10.1109/TIT.2025.3545662}}

copy to clipboard

DNA labeling is a powerful tool in molecular biology and biotechnology that allows for the visualization, detection, and study of DNA at the molecular level. Under this paradigm, a DNA molecule is being labeled by specific k patterns and is then imaged. Then, the resulting image is modeled as a (k+1) -ary sequence in which any non-zero symbol indicates on the appearance of the corresponding label in the DNA molecule. The primary goal of this work is to study the labeling capacity, which is defined as the maximal information rate that can be obtained using this labeling process. The labeling capacity is computed for almost any pattern of a single label and several results for multiple labels are provided as well. Moreover, we provide the optimal minimal number of labels of length one or two, over any alphabet of size q, that are needed in order to achieve the maximum labeling capacity of log2(q) . Lastly, we discuss the maximal labeling capacity that can be achieved using a certain number of labels of length two.

Yevgeni Nogin, Daniella Bar-Lev, Dganit Hanania, Tahir Detinis Zur, Yuval Ebenstein, Prof. Eitan Yaakobi, Nir Weinberger, Yoav Shechtman, "Design of optimal labeling patterns for optical genome mapping via information theory", Bioinformatics, Volume 39, Issue 10, October 2023

abstractBibTeX

BibTeX

@article{10.1093/bioinformatics/btad601,
author = {Nogin, Yevgeni and Bar-Lev, Daniella and Hanania, Dganit and Detinis Zur, Tahir and Ebenstein, Yuval and Yaakobi, Eitan and Weinberger, Nir and Shechtman, Yoav},
title = "{Design of optimal labeling patterns for optical genome mapping via information theory}",
journal = {Bioinformatics},
volume = {39},
number = {10},
pages = {btad601},
year = {2023},
month = {09},
abstract = "{Optical genome mapping (OGM) is a technique that extracts partial genomic information from optically imaged and linearized DNA fragments containing fluorescently labeled short sequence patterns. This information can be used for various genomic analyses and applications, such as the detection of structural variations and copy-number variations, epigenomic profiling, and microbial species identification. Currently, the choice of labeled patterns is based on the available biochemical methods and is not necessarily optimized for the application.In this work, we develop a model of OGM based on information theory, which enables the design of optimal labeling patterns for specific applications and target organism genomes. We validated the model through experimental OGM on human DNA and simulations on bacterial DNA. Our model predicts up to 10-fold improved accuracy by optimal choice of labeling patterns, which may guide future development of OGM biochemical labeling methods and significantly improve its accuracy and yield for applications such as epigenomic profiling and cultivation-free pathogen identification in clinical samples.https://github.com/yevgenin/PatternCode}",
issn = {1367-4811},
doi = {10.1093/bioinformatics/btad601},
url = {https://doi.org/10.1093/bioinformatics/btad601},
eprint = {https://academic.oup.com/bioinformatics/article-pdf/39/10/btad601/51972115/btad601.pdf},
}

copy to clipboard

Motivation

Optical genome mapping (OGM) is a technique that extracts partial genomic information from optically imaged and linearized DNA fragments containing fluorescently labeled short sequence patterns. This information can be used for various genomic analyses and applications, such as the detection of structural variations and copy-number variations, epigenomic profiling, and microbial species identification. Currently, the choice of labeled patterns is based on the available biochemical methods and is not necessarily optimized for the application.

Results

In this work, we develop a model of OGM based on information theory, which enables the design of optimal labeling patterns for specific applications and target organism genomes. We validated the model through experimental OGM on human DNA and simulations on bacterial DNA. Our model predicts up to 10-fold improved accuracy by optimal choice of labeling patterns, which may guide future development of OGM biochemical labeling methods and significantly improve its accuracy and yield for applications such as epigenomic profiling and cultivation-free pathogen identification in clinical samples.

Conferences

Dganit Hanania, Prof. Eitan Yaakobi, "Error-Correcting Codes for Labeled DNA Sequences", International Symposium on Information Theory (ISIT2025).

abstract

Labeling of DNA molecules is a fundamental technique for DNA visualization and analysis. This process was mathematically modeled in [1], where the received sequence indicates the positions of the used labels. In this work, we develop error correcting codes for labeled DNA sequences, establishing bounds and constructing explicit systematic encoders for single substitution, insertion, and deletion errors. We focus on two cases: (1) using the complete set of length-two labels and (2) using the minimal set of length-two labels that ensures the recovery of DNA sequences from their labeling for ‘almost’ all DNA sequences.

Christoph Hofmeister, Anina Gruica, Dganit Hanania, Rawad Bitar, Prof. Eitan Yaakobi, "Achieving DNA Labeling Capacity with Minimum Labels through Extremal de Bruijn Subgraphs", International Symposium on Information Theory (ISIT2024)

abstractBibTeX

BibTeX

@misc{hofmeister2024achievingdnalabelingcapacity,
title={Achieving DNA Labeling Capacity with Minimum Labels through Extremal de Bruijn Subgraphs},
author={Christoph Hofmeister and Anina Gruica and Dganit Hanania and Rawad Bitar and Eitan Yaakobi},
year={2024},
eprint={2401.15733},
archivePrefix={arXiv},
primaryClass={cs.IT},
url={https://arxiv.org/abs/2401.15733},
}

copy to clipboard

DNA labeling is a tool in molecular biology and biotechnology to visualize, detect, and study DNA at the molecular level. In this process, a DNA molecule is labeled by a set of specific patterns, referred to as labels, and is then imaged. The resulting image is modeled as an (ℓ+1)-ary sequence, where ℓ is the number of labels, in which any non-zero symbol indicates the appearance of the corresponding label in the DNA molecule. The labeling capacity refers to the maximum information rate that can be achieved by the labeling process for any given set of labels. The main goal of this paper is to study the minimum number of labels of the same length required to achieve the maximum labeling capacity of 2 for DNA sequences or log2q for an arbitrary alphabet of size q. The solution to this problem requires the study of path unique subgraphs of the de Bruijn graph with the largest number of edges and we provide upper and lower bounds on this value.

Dganit Hanania, Daniella Bar-Lev, Yevgeni Nogin, Yoav Shechtman, Prof. Eitan Yaakobi, "On the Capacity of DNA Labeling", International Symposium on Information Theory (ISIT2023)

abstractBibTeX

BibTeX

@INPROCEEDINGS{10206769,
author={Hanania, Dganit and Bar-Lev, Daniella and Nogin, Yevgeni and Shechtman, Yoav and Yaakobi, Eitan},
booktitle={2023 IEEE International Symposium on Information Theory (ISIT)},
title={On the Capacity of DNA Labeling},
year={2023},
volume={},
number={},
pages={567-572},
keywords={Visualization;Biotechnology;Biological system modeling;DNA;Symbols;Genetic communication;Labeling},
doi={10.1109/ISIT54713.2023.10206769}}

copy to clipboard

DNA labeling is a powerful tool in molecular biology and biotechnology that allows for the visualization, detection, and study of DNA at the molecular level. Under this paradigm, a DNA molecule is being labeled by specific k patterns and is then imaged. Then, the resulted image is modeled as a (k +1)-ary sequence in which any non-zero symbol indicates on the appearance of the corresponding label in the DNA molecule. The primary goal of this work is to study the labeling capacity, which is defined as the maximal information rate that can be obtained using this labeling process. The labeling capacity is computed for any single label and several results are provided for multiple labels as well. Moreover, we provide the optimal minimal number of labels of length one or two that are needed in order to gain labeling capacity of 2.

Machine Learning

Journal Papers

Adir Kobovich, Nir Weinberger, "Input Optimization in the Composite DNA Storage Channel", IEEE Journal on Selected Areas in Information Theory.

abstractBibTeX

BibTeX

@ARTICLE{11107232,
author={Kobovich, Adir and Weinberger, Nir},
journal={IEEE Journal on Selected Areas in Information Theory},
title={Input Optimization in the Composite DNA Storage Channel},
year={2025},
volume={6},
number={},
pages={248-260},
doi={10.1109/JSAIT.2025.3595005}}

copy to clipboard

Recent advancements in DNA storage show that composite DNA letters can significantly enhance storage capacity. We model this process as a multinomial channel and propose an optimization algorithm to determine its capacity-achieving input distribution (CAID) for an arbitrary number of output reads. Our empirical results match a scaling law that determines that the support size grows exponentially with capacity. In addition, we introduce a limited-support optimization algorithm that optimizes the input distribution under a restricted support size, making it more feasible for real-world DNA storage systems. We also extend our model to account for noise and study its effect on capacity and input design.

Daniella Bar-Lev, Itai Or, Omer Sabary, Tuvi Etzion, Prof. Eitan Yaakobi, "Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and Deep Learning", Arxiv, preprint

abstractBibTeX

BibTeX

@article{bar2021deep,
title={Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory
and Deep Learning},
author={Bar-Lev, Daniella and Orr, Itai and Sabary, Omer and Etzion, Tuvi
and Yaakobi, Eitan},
journal={arXiv preprint arXiv:2109.00031},
year={2021}
}

copy to clipboard

The concept of DNA storage was first suggested in 1959 by Richard Feynman who shared his vision regarding nanotechnology in the talk “There is plenty of room at the bottom”. Later, towards the end of the 20-th century, the interest in storage solutions based on DNA molecules was increased as a result of the human genome project which in turn led to a significant progress in sequencing and assembly methods. DNA storage enjoys major advantages over the well-established magnetic and optical storage solutions. As opposed to magnetic solutions, DNA storage does not require electrical supply to maintain data integrity and is superior to other storage solutions in both density and durability. Given the trends in cost decreases of DNA synthesis and sequencing, it is now acknowledged that within the next 10-15 years DNA storage may become a highly competitive archiving technology and probably later the main such technology. With that said, the current implementations of DNA based storage systems are very limited and are not fully optimized to address the unique pattern of errors which characterize the synthesis and sequencing processes. In this work, we propose a robust, efficient and scalable solution to implement DNA-based storage systems. Our method deploys Deep Neural Networks (DNN) which reconstruct a sequence of letters based on imperfect cluster of copies generated by the synthesis and sequencing processes. A tailor-made Error-Correcting Code (ECC) is utilized to combat patterns of errors which occur during this process. Since our reconstruction method is adapted to imperfect clusters, our method overcomes the time bottleneck of the noisy DNA copies clustering process by allowing the use of a rapid and scalable pseudo-clustering instead. Our architecture combines between convolutions and transformers blocks and is trained using synthetic data modelled after real data statistics.

Conferences

Adir Kobovich, Prof. Eitan Yaakobi, Nir Weinberger, "DeepDIVE: Optimizing Input-Constrained Distributions for Composite DNA Storage via Multinomial Channel", International Symposium on Information Theory (ISIT2025)

abstractBibTeX

BibTeX

@article{kobovich2025deepdive,
title={{DeepDIVE}: Optimizing Input-Constrained Distributions for Composite {DNA} Storage via Multinomial Channel},
author={Kobovich, Adir and Yaakobi, Eitan and Weinberger, Nir},
journal={arXiv preprint arXiv:2501.15172},
year={2025}
}

copy to clipboard

We address the challenge of optimizing the capacity-achieving input distribution for a multinomial channel under the constraint of limited input support size, which is a crucial aspect in the design of DNA storage systems. We propose an algorithm that further elaborates the Multidimensional Dynamic Assignment Blahut-Arimoto (M-DAB) algorithm. Our proposed algorithm integrates variational autoencoder for determining the optimal locations of input distribution, into the alternating optimization of the input distribution locations and weights.

Software and Tools

Adir Kobovich, Nir Weinberger, "Input Optimization in the Composite DNA Storage Channel", Journal on Selected Areas in Information Theory.

abstractBibTeX

BibTeX

@misc{composite_code,
title = {Input Optimization in the Composite DNA Storage Channel},
author = {Adir Kobovich and Nir Weinberger},
journal = {Journal on Selected Areas in Information Theory},
doi = {10.24433/CO.5268435.v1},
howpublished = {\url{https://www.codeocean.com/}},
year = 2025,
month = {6},
version = {v1}
}

copy to clipboard

Information Theory

Journal Papers

Yevgeni Nogin, Daniella Bar-Lev, Dganit Hanania, Tahir Detinis Zur, Yuval Ebenstein, Prof. Eitan Yaakobi, Nir Weinberger, Yoav Shechtman, "Design of optimal labeling patterns for optical genome mapping via information theory", Bioinformatics, Volume 39, Issue 10, October 2023

abstractBibTeX

BibTeX

@article{10.1093/bioinformatics/btad601,
author = {Nogin, Yevgeni and Bar-Lev, Daniella and Hanania, Dganit and Detinis Zur, Tahir and Ebenstein, Yuval and Yaakobi, Eitan and Weinberger, Nir and Shechtman, Yoav},
title = "{Design of optimal labeling patterns for optical genome mapping via information theory}",
journal = {Bioinformatics},
volume = {39},
number = {10},
pages = {btad601},
year = {2023},
month = {09},
abstract = "{Optical genome mapping (OGM) is a technique that extracts partial genomic information from optically imaged and linearized DNA fragments containing fluorescently labeled short sequence patterns. This information can be used for various genomic analyses and applications, such as the detection of structural variations and copy-number variations, epigenomic profiling, and microbial species identification. Currently, the choice of labeled patterns is based on the available biochemical methods and is not necessarily optimized for the application.In this work, we develop a model of OGM based on information theory, which enables the design of optimal labeling patterns for specific applications and target organism genomes. We validated the model through experimental OGM on human DNA and simulations on bacterial DNA. Our model predicts up to 10-fold improved accuracy by optimal choice of labeling patterns, which may guide future development of OGM biochemical labeling methods and significantly improve its accuracy and yield for applications such as epigenomic profiling and cultivation-free pathogen identification in clinical samples.https://github.com/yevgenin/PatternCode}",
issn = {1367-4811},
doi = {10.1093/bioinformatics/btad601},
url = {https://doi.org/10.1093/bioinformatics/btad601},
eprint = {https://academic.oup.com/bioinformatics/article-pdf/39/10/btad601/51972115/btad601.pdf},
}

copy to clipboard

Motivation

Optical genome mapping (OGM) is a technique that extracts partial genomic information from optically imaged and linearized DNA fragments containing fluorescently labeled short sequence patterns. This information can be used for various genomic analyses and applications, such as the detection of structural variations and copy-number variations, epigenomic profiling, and microbial species identification. Currently, the choice of labeled patterns is based on the available biochemical methods and is not necessarily optimized for the application.

Results

In this work, we develop a model of OGM based on information theory, which enables the design of optimal labeling patterns for specific applications and target organism genomes. We validated the model through experimental OGM on human DNA and simulations on bacterial DNA. Our model predicts up to 10-fold improved accuracy by optimal choice of labeling patterns, which may guide future development of OGM biochemical labeling methods and significantly improve its accuracy and yield for applications such as epigenomic profiling and cultivation-free pathogen identification in clinical samples.

Conferences

Adir Kobovich, Prof. Eitan Yaakobi, Nir Weinberger, "M-DAB: An Input-Distribution Optimization Algorithm for Composite DNA Storage by the Multinomial Channel", The International Zurich Seminar on Information and Communication (IZS2024)

abstractBibTeX

BibTeX

@article{kobovich2024MDAB,
title={ {M-DAB}: An Input-Distribution Optimization Algorithm for Composite {DNA} Storage by the Multinomial Channel },
author={ Kobovich, Adir and Yaakobi, Eitan and Weinberger, Nir},
booktitle={ 2024 IZS Proceedings },
pages={82--86},
year={2024}
}

copy to clipboard

Recent experiments have shown that the capacity of DNA storage systems may be significantly increased by synthesizing composite DNA letters. In this work, we model a DNA storage channel with composite inputs as a multinomial channel, and propose an optimization algorithm for its capacity achieving input distribution, for an arbitrary number of output reads. The algorithm is termed multidimensional dynamic assignment Blahut-Arimoto (M-DAB), and is a generalized version of the DAB algorithm, proposed by Wesel et al. [1] developed for the binomial channel. We also empirically observe a scaling law behavior of the capacity as a function of the support size of the capacity-achieving input distribution.

Omer Yerushalmi, Prof. Tuvi Etzion, Prof. Eitan Yaakobi, "The Capacity of the Weighted Read Channel", International Symposium on Information Theory (ISIT2024)

Daniella Bar-Lev, Omer Sabary, Ryan Gabrys, Prof. Eitan Yaakobi, "Cover Your Bases: How to Minimize the Sequencing Coverage in DNA Storage Systems", IEEE International Symposium on Information Theory (ISIT) 2023.

abstractBibTeX

BibTeX

@INPROCEEDINGS{10206882,
author={Bar-Lev, Daniella and Sabary, Omer and Gabrys, Ryan and Yaakobi, Eitan},
booktitle={2023 IEEE International Symposium on Information Theory (ISIT)},
title={Cover Your Bases: How to Minimize the Sequencing Coverage in DNA Storage Systems},
year={2023},
volume={},
number={},
pages={370-375},
keywords={Sequential analysis;Costs;DNA;Genetic communication;Error correction codes},
doi={10.1109/ISIT54713.2023.10206882}}

copy to clipboard

Although the expenses associated with DNA sequencing have been rapidly decreasing, the current cost stands at roughly 1.3K/TB, which is dramatically more expensive than reading from existing archival storage solutions today. In this work, we aim to reduce not only the cost but also the latency of DNA storage by studying the DNA coverage depth problem, which aims to reduce the required number of reads to retrieve information from the storage system. Under this framework, our main goal is to understand how to optimally pair an error-correcting code with a given retrieval algorithm to minimize the sequencing coverage depth, while guaranteeing retrieval of the information with high probability. Additionally, we study the DNA coverage depth problem under the random-access setup.