This paper studies the adversarial torn-paper channel. This problem is motivated by applications in DNA data storage where the DNA strands that carry the information may break into smaller pieces that are received out of order. Our model extends the previously researched probabilistic setting to the worst-case. We deve lop code constructions for any parameters of the channel for which non-vanishing asymptotic rate is possible and show that our constructions achieve optimal asymptotic rate while allowing for efficient encoding and decoding. Finally, we extend our results to related settings included multi-strand storage, presence of substitution errors, or incomplete coverage.
Reliability is an inherent challenge for the emerging nonvolatile technology of racetrack memories, and there exists a fundamental relationship between codes designed for racetrack memories and codes with constrained periodicity. Previous works have sought to construct codes that avoid periodicity in windows, yet have either only provided existence proofs or required high redundancy. This paper provides the first constructions for avoiding periodicity that are both efficient (average-linear time) and with low redundancy (near the lower bound). The proposed algorithms are based on iteratively repairing windows which contain periodicity until all the windows are valid. Intuitively, such algorithms should not converge as there is no monotonic progression; yet, we prove convergence with average-linear time complexity by exploiting subtle properties of the encoder. Overall, we both provide constructions that avoid periodicity in all windows, and we also study the cardinality of such constraints.
Although the expenses associated with DNA sequencing have been rapidly decreasing, the current cost stands at roughly 1.3K/TB, which is dramatically more expensive than reading from existing archival storage solutions today. In this work, we aim to reduce not only the cost but also the latency of DNA storage by studying the DNA coverage depth problem, which aims to reduce the required number of reads to retrieve information from the storage system. Under this framework, our main goal is to understand how to optimally pair an error-correcting code with a given retrieval algorithm to minimize the sequencing coverage depth, while guaranteeing retrieval of the information with high probability. Additionally, we study the DNA coverage depth problem under the random-access setup.
The concept of DNA storage was first suggested in 1959 by Richard Feynman who shared his vision regarding nanotechnology in the talk “There is plenty of room at the bottom”. Later, towards the end of the 20-th century, the interest in storage solutions based on DNA molecules was increased as a result of the human genome project which in turn led to a significant progress in sequencing and assembly methods. DNA storage enjoys major advantages over the well-established magnetic and optical storage solutions. As opposed to magnetic solutions, DNA storage does not require electrical supply to maintain data integrity and is superior to other storage solutions in both density and durability. Given the trends in cost decreases of DNA synthesis and sequencing, it is now acknowledged that within the next 10-15 years DNA storage may become a highly competitive archiving technology and probably later the main such technology. With that said, the current implementations of DNA based storage systems are very limited and are not fully optimized to address the unique pattern of errors which characterize the synthesis and sequencing processes. In this work, we propose a robust, efficient and scalable solution to implement DNA-based storage systems. Our method deploys Deep Neural Networks (DNN) which reconstruct a sequence of letters based on imperfect cluster of copies generated by the synthesis and sequencing processes. A tailor-made Error-Correcting Code (ECC) is utilized to combat patterns of errors which occur during this process. Since our reconstruction method is adapted to imperfect clusters, our method overcomes the time bottleneck of the noisy DNA copies clustering process by allowing the use of a rapid and scalable pseudo-clustering instead. Our architecture combines between convolutions and transformers blocks and is trained using synthetic data modelled after real data statistics.
Motivation
Optical genome mapping (OGM) is a technique that extracts partial genomic information from optically imaged and linearized DNA fragments containing fluorescently labeled short sequence patterns. This information can be used for various genomic analyses and applications, such as the detection of structural variations and copy-number variations, epigenomic profiling, and microbial species identification. Currently, the choice of labeled patterns is based on the available biochemical methods and is not necessarily optimized for the application.
Results
In this work, we develop a model of OGM based on information theory, which enables the design of optimal labeling patterns for specific applications and target organism genomes. We validated the model through experimental OGM on human DNA and simulations on bacterial DNA. Our model predicts up to 10-fold improved accuracy by optimal choice of labeling patterns, which may guide future development of OGM biochemical labeling methods and significantly improve its accuracy and yield for applications such as epigenomic profiling and cultivation-free pathogen identification in clinical samples.
DNA labeling is a powerful tool in molecular biology and biotechnology that allows for the visualization, detection, and study of DNA at the molecular level. Under this paradigm, a DNA molecule is being labeled by specific k patterns and is then imaged. Then, the resulting image is modeled as a (k+1) -ary sequence in which any non-zero symbol indicates on the appearance of the corresponding label in the DNA molecule. The primary goal of this work is to study the labeling capacity, which is defined as the maximal information rate that can be obtained using this labeling process. The labeling capacity is computed for almost any pattern of a single label and several results for multiple labels are provided as well. Moreover, we provide the optimal minimal number of labels of length one or two, over any alphabet of size q, that are needed in order to achieve the maximum labeling capacity of log2(q) . Lastly, we discuss the maximal labeling capacity that can be achieved using a certain number of labels of length two.
DNA labeling is a powerful tool in molecular biology and biotechnology that allows for the visualization, detection, and study of DNA at the molecular level. Under this paradigm, a DNA molecule is being labeled by specific k patterns and is then imaged. Then, the resulted image is modeled as a (k +1)-ary sequence in which any non-zero symbol indicates on the appearance of the corresponding label in the DNA molecule. The primary goal of this work is to study the labeling capacity, which is defined as the maximal information rate that can be obtained using this labeling process. The labeling capacity is computed for any single label and several results are provided for multiple labels as well. Moreover, we provide the optimal minimal number of labels of length one or two that are needed in order to gain labeling capacity of 2.
This paper presents a novel approach to address the constrained coding challenge of generating almost-balanced sequences. While strictly balanced sequences have been well studied in the past, the problem of designing efficient algorithms with small redundancy, preferably constant or even a single bit, for almost balanced sequences has remained unsolved. A sequence is ε(n)-almost balanced if its Hamming weight is between 0.5n±ε(n). It is known that for any algorithm with a constant number of bits, ε(n) has to be in the order of Θ(√n), with O(n) average time complexity. However, prior solutions with a single redundancy bit required ε(n) to be a linear shift from n/2. Employing an iterative method and arithmetic coding, our emphasis lies in constructing almost balanced codes with a single redundancy bit. Notably, our method surpasses previous approaches by achieving the optimal balanced order of Θ(√n). Additionally, we extend our method to the non-binary case considering q-ary almost polarity-balanced sequences for even q, and almost symbol-balanced for q=4. Our work marks the first asymptotically optimal solutions for almost-balanced sequences, for both, binary and non-binary alphabet.
Constrained coding is a fundamental field in coding theory that tackles efficient communication through constrained channels. While fixed constraints (e.g., a fixed set of substrings may not appear in transmitted messages) have a general optimal solution, there is increasing demand for supporting parametric constraints that are dependent on the message length and portray some property that the substrings must satisfy (e.g., no log (n) consecutive zeros). Several works have tackled such parametric constraints through iterative algorithms following the sequence-replacement approach, yet this approach requires complex constraint-specific properties to guarantee convergence through monotonic progression. In this paper, we propose a universal framework for tackling any parametric constraint problem with far fewer requirements, through a simple iterative algorithm. By reducing an execution of this iterative algorithm to an acyclic graph traversal, we prove a surprising result that guarantees convergence with efficient average time complexity even without requiring any monotonic progression. We demonstrate how to apply this algorithm to the run-length-limited, minimal Hamming weight, local almost-balanced Hamming weight constraints, as well as repeat-free and secondary-structure constraints. Overall, this framework enables state-of-the-art results with minimal effort.


