DDS-E-Sim: A Transformer-Based Generative Framework for Simulating Error-Prone Sequences in DNA Data Storage
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The fast-growing amount of data needs reliable and long-lasting storage solutions. DNA has emerged as a promising medium due to its high information density and long-term stability. However, DNA storage is a complex process where each stage introduces noise and errors, including synthesis errors, storage decay, and sequencing errors, which require error-correcting codes (ECCs) for reliable data recovery. To design an optimal data recovery method, a comprehensive understanding of the noise structure in a DNA data storage channel is crucial. Since running DNA data storage experiments in vitro is still expensive and time-consuming, a simulation model is quite necessary that can mimic the error patterns in the real data and simulate the experiments. Existing simulation tools often rely on fixed error probabilities or are specific to certain technologies. In this study, we present a transformer-based generative framework for simulating errors in a DNA data storage channel. Our simulator takes oligos (DNA sequences to write) as input and generates erroneous output DNA reads that closely resemble the real-life output of common DNA data storage pipelines. It captures both random and biased error patterns, such as k-mer and transition errors, regardless of the process or technology. We demonstrate the effectiveness of our simulator by analyzing two datasets processed with distinct technologies. In the first case, processed with Illumina MiSeq, sequences simulated by DDS-E-SIM exhibit a total error rate deviation of only 0.1\% from the original dataset. The second, processed with Oxford Nanopore Technologies, shows a 0.7\% deviation. Both base-level and k-mer errors closely align with the original dataset. Additionally, our simulator generates 100,743 unique oligos from 35,329 sequences, with each sequence read five times, demonstrating its ability to simulate biased errors and stochastic properties simultaneously. Our simulator outperforms existing simulators with superior accuracy and the ability to handle diverse sequencing technologies.