DNA Storage Lab | Publications

Daniella Bar-Lev, Sagi Markovich, Prof. Eitan Yaakobi, Yonatan Yehezkeally, "Adversarial Torn - paper Codes", IEEE International Symposium on Information Theory (ISIT), 2022, pp. 2934 – 2939

abstractBibTeX

BibTeX

@INPROCEEDINGS{9834766,
author={Bar-Lev, Daniella and Marcovich, Sagi and Yaakobi, Eitan and Yehezkeally, Yonatan},
booktitle={2022 IEEE International Symposium on Information Theory (ISIT)},
title={Adversarial Torn
-
paper Codes},
year={2022},
volume={},
number={},
pages={2934-2939},
doi={10.1109/ISIT50566.2022.9834766}}

copy to clipboard

presentationgithub

This paper studies the adversarial torn-paper channel. This problem is motivated by applications in DNA data storage where the DNA strands that carry the information may break into smaller pieces that are received out of order. Our model extends the previously researched probabilistic setting to the worst-case. We deve lop code constructions for any parameters of the channel for which non-vanishing asymptotic rate is possible and show that our constructions achieve optimal asymptotic rate while allowing for efficient encoding and decoding. Finally, we extend our results to related settings included multi-strand storage, presence of substitution errors, or incomplete coverage.

Adir Kobovich, Orian Leitersdorf, Daniella Bar-Lev, Prof. Eitan Yaakobi, "Codes for Constrained Periodicity", International Symposium on Information Theory and Its Applications (ISITA2022), awarded

abstractBibTeX

BibTeX

@article{kobovich2022periodicity,
title={Codes for constrained periodicity},
author={Kobovich, Adir and Leitersdorf, Orian and Bar-Lev, Daniella and Yaakobi, Eitan},
booktitle={IEEE International Symposium on Information Theory and its Applications (ISITA)},
year={2022}
}

copy to clipboard

Reliability is an inherent challenge for the emerging nonvolatile technology of racetrack memories, and there exists a fundamental relationship between codes designed for racetrack memories and codes with constrained periodicity. Previous works have sought to construct codes that avoid periodicity in windows, yet have either only provided existence proofs or required high redundancy. This paper provides the first constructions for avoiding periodicity that are both efficient (average-linear time) and with low redundancy (near the lower bound). The proposed algorithms are based on iteratively repairing windows which contain periodicity until all the windows are valid. Intuitively, such algorithms should not converge as there is no monotonic progression; yet, we prove convergence with average-linear time complexity by exploiting subtle properties of the encoder. Overall, we both provide constructions that avoid periodicity in all windows, and we also study the cardinality of such constraints.

Daniella Bar-Lev, Omer Sabary, Ryan Gabrys, Prof. Eitan Yaakobi, "Cover Your Bases: How to Minimize the Sequencing Coverage in DNA Storage Systems", IEEE International Symposium on Information Theory (ISIT) 2023.

abstractBibTeX

BibTeX

@INPROCEEDINGS{10206882,
author={Bar-Lev, Daniella and Sabary, Omer and Gabrys, Ryan and Yaakobi, Eitan},
booktitle={2023 IEEE International Symposium on Information Theory (ISIT)},
title={Cover Your Bases: How to Minimize the Sequencing Coverage in DNA Storage Systems},
year={2023},
volume={},
number={},
pages={370-375},
keywords={Sequential analysis;Costs;DNA;Genetic communication;Error correction codes},
doi={10.1109/ISIT54713.2023.10206882}}

copy to clipboard

Although the expenses associated with DNA sequencing have been rapidly decreasing, the current cost stands at roughly 1.3K/TB, which is dramatically more expensive than reading from existing archival storage solutions today. In this work, we aim to reduce not only the cost but also the latency of DNA storage by studying the DNA coverage depth problem, which aims to reduce the required number of reads to retrieve information from the storage system. Under this framework, our main goal is to understand how to optimally pair an error-correcting code with a given retrieval algorithm to minimize the sequencing coverage depth, while guaranteeing retrieval of the information with high probability. Additionally, we study the DNA coverage depth problem under the random-access setup.

Daniella Bar-Lev, Itai Or, Omer Sabary, Tuvi Etzion, Prof. Eitan Yaakobi, "Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and Deep Learning", Arxiv, preprint

abstractBibTeX

BibTeX

@article{bar2021deep,
title={Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory
and Deep Learning},
author={Bar-Lev, Daniella and Orr, Itai and Sabary, Omer and Etzion, Tuvi
and Yaakobi, Eitan},
journal={arXiv preprint arXiv:2109.00031},
year={2021}
}

copy to clipboard

presentationgithub

The concept of DNA storage was first suggested in 1959 by Richard Feynman who shared his vision regarding nanotechnology in the talk “There is plenty of room at the bottom”. Later, towards the end of the 20-th century, the interest in storage solutions based on DNA molecules was increased as a result of the human genome project which in turn led to a significant progress in sequencing and assembly methods. DNA storage enjoys major advantages over the well-established magnetic and optical storage solutions. As opposed to magnetic solutions, DNA storage does not require electrical supply to maintain data integrity and is superior to other storage solutions in both density and durability. Given the trends in cost decreases of DNA synthesis and sequencing, it is now acknowledged that within the next 10-15 years DNA storage may become a highly competitive archiving technology and probably later the main such technology. With that said, the current implementations of DNA based storage systems are very limited and are not fully optimized to address the unique pattern of errors which characterize the synthesis and sequencing processes. In this work, we propose a robust, efficient and scalable solution to implement DNA-based storage systems. Our method deploys Deep Neural Networks (DNN) which reconstruct a sequence of letters based on imperfect cluster of copies generated by the synthesis and sequencing processes. A tailor-made Error-Correcting Code (ECC) is utilized to combat patterns of errors which occur during this process. Since our reconstruction method is adapted to imperfect clusters, our method overcomes the time bottleneck of the noisy DNA copies clustering process by allowing the use of a rapid and scalable pseudo-clustering instead. Our architecture combines between convolutions and transformers blocks and is trained using synthetic data modelled after real data statistics.

Yevgeni Nogin, Daniella Bar-Lev, Dganit Hanania, Tahir Detinis Zur, Yuval Ebenstein, Prof. Eitan Yaakobi, Nir Weinberger, Yoav Shechtman, "Design of optimal labeling patterns for optical genome mapping via information theory", Bioinformatics, Volume 39, Issue 10, October 2023

abstractBibTeX

BibTeX

@article{10.1093/bioinformatics/btad601,
author = {Nogin, Yevgeni and Bar-Lev, Daniella and Hanania, Dganit and Detinis Zur, Tahir and Ebenstein, Yuval and Yaakobi, Eitan and Weinberger, Nir and Shechtman, Yoav},
title = "{Design of optimal labeling patterns for optical genome mapping via information theory}",
journal = {Bioinformatics},
volume = {39},
number = {10},
pages = {btad601},
year = {2023},
month = {09},
abstract = "{Optical genome mapping (OGM) is a technique that extracts partial genomic information from optically imaged and linearized DNA fragments containing fluorescently labeled short sequence patterns. This information can be used for various genomic analyses and applications, such as the detection of structural variations and copy-number variations, epigenomic profiling, and microbial species identification. Currently, the choice of labeled patterns is based on the available biochemical methods and is not necessarily optimized for the application.In this work, we develop a model of OGM based on information theory, which enables the design of optimal labeling patterns for specific applications and target organism genomes. We validated the model through experimental OGM on human DNA and simulations on bacterial DNA. Our model predicts up to 10-fold improved accuracy by optimal choice of labeling patterns, which may guide future development of OGM biochemical labeling methods and significantly improve its accuracy and yield for applications such as epigenomic profiling and cultivation-free pathogen identification in clinical samples.https://github.com/yevgenin/PatternCode}",
issn = {1367-4811},
doi = {10.1093/bioinformatics/btad601},
url = {https://doi.org/10.1093/bioinformatics/btad601},
eprint = {https://academic.oup.com/bioinformatics/article-pdf/39/10/btad601/51972115/btad601.pdf},
}

copy to clipboard

Motivation

Optical genome mapping (OGM) is a technique that extracts partial genomic information from optically imaged and linearized DNA fragments containing fluorescently labeled short sequence patterns. This information can be used for various genomic analyses and applications, such as the detection of structural variations and copy-number variations, epigenomic profiling, and microbial species identification. Currently, the choice of labeled patterns is based on the available biochemical methods and is not necessarily optimized for the application.

Results

In this work, we develop a model of OGM based on information theory, which enables the design of optimal labeling patterns for specific applications and target organism genomes. We validated the model through experimental OGM on human DNA and simulations on bacterial DNA. Our model predicts up to 10-fold improved accuracy by optimal choice of labeling patterns, which may guide future development of OGM biochemical labeling methods and significantly improve its accuracy and yield for applications such as epigenomic profiling and cultivation-free pathogen identification in clinical samples.

Avital Boruchovsky, Daniella Bar-Lev, Prof. Eitan Yaakobi, "DNA-Correcting Codes: End-to-end Correction in DNA Storage Systems", International Symposium on Information Theory (ISIT2023)

Dganit Hanania, Daniella Bar-Lev, Yevgeni Nogin, Yoav Shechtman, Prof. Eitan Yaakobi, "On the Capacity of DNA Labeling", International Symposium on Information Theory (ISIT2023)

abstractBibTeX

BibTeX

@INPROCEEDINGS{10206769,
author={Hanania, Dganit and Bar-Lev, Daniella and Nogin, Yevgeni and Shechtman, Yoav and Yaakobi, Eitan},
booktitle={2023 IEEE International Symposium on Information Theory (ISIT)},
title={On the Capacity of DNA Labeling},
year={2023},
volume={},
number={},
pages={567-572},
keywords={Visualization;Biotechnology;Biological system modeling;DNA;Symbols;Genetic communication;Labeling},
doi={10.1109/ISIT54713.2023.10206769}}

copy to clipboard

DNA labeling is a powerful tool in molecular biology and biotechnology that allows for the visualization, detection, and study of DNA at the molecular level. Under this paradigm, a DNA molecule is being labeled by specific k patterns and is then imaged. Then, the resulted image is modeled as a (k +1)-ary sequence in which any non-zero symbol indicates on the appearance of the corresponding label in the DNA molecule. The primary goal of this work is to study the labeling capacity, which is defined as the maximal information rate that can be obtained using this labeling process. The labeling capacity is computed for any single label and several results are provided for multiple labels as well. Moreover, we provide the optimal minimal number of labels of length one or two that are needed in order to gain labeling capacity of 2.

Daniella Bar-Lev, Adir Kobovich, Orian Leitersdorf, Prof. Eitan Yaakobi, "Optimal Almost-Balanced Sequences", International Symposium on Information Theory (ISIT2024)

abstractBibTeX

BibTeX

@article{ bar2024balance,
title={ Optimal Almost-Balanced Sequences },
author={ Bar-Lev, Daniella and Kobovich, Adir and Leitersdorf, Orian and Yaakobi, Eitan},
booktitle={ Proceedings of the IEEE International Symposium on Information Theory (ISIT) },
year={ 2024 }
}

copy to clipboard

This paper presents a novel approach to address the constrained coding challenge of generating almost-balanced sequences. While strictly balanced sequences have been well studied in the past, the problem of designing efficient algorithms with small redundancy, preferably constant or even a single bit, for almost balanced sequences has remained unsolved. A sequence is ε(n)-almost balanced if its Hamming weight is between 0.5n±ε(n). It is known that for any algorithm with a constant number of bits, ε(n) has to be in the order of Θ(√n), with O(n) average time complexity. However, prior solutions with a single redundancy bit required ε(n) to be a linear shift from n/2. Employing an iterative method and arithmetic coding, our emphasis lies in constructing almost balanced codes with a single redundancy bit. Notably, our method surpasses previous approaches by achieving the optimal balanced order of Θ(√n). Additionally, we extend our method to the non-binary case considering q-ary almost polarity-balanced sequences for even q, and almost symbol-balanced for q=4. Our work marks the first asymptotically optimal solutions for almost-balanced sequences, for both, binary and non-binary alphabet.