A practical DNA data storage using an expanded alphabet introducing 5-methylcytosine

Deruilin Liu
Demin Xu
Liuxin Shi
Jiayuan Zhang
Kewei Bi
Bei Luo
Chen Liu
Yuxiang Li
Guangyi Fan
Wen Wang
Zhi Ping

Curated by GigaByte

Editors Assessment:

DNA has huge potential as a data storage medium because of its incredibly high storage density and stability. This work addresses the potential of modified bases, specifically 5-methylcytosine (5mC), in enhancing DNA data storage systems. This paper introduces a transcoding scheme named R+, which incorporates this modified 5mC base to increase information density beyond the standard limits. By encoding various file types into DNA sequences of between 1.3 to 1.6 kb in size, this method achieves an average recovery rate of 98.97% (with reference), validating the effectiveness of the method. On top of a wet-lab protocol (hosted in protocols.io) for the experimental validation of the transcoding scheme, it also includes open source code for in-silico simulation tests. Peer review scruitinising the protocols and validation are reusable and provide convincing results. As nanopore sequencing has enabled reading of these modified bases, it is timely making them applicable as extra letters in the molecular alphabet for DNA data storage

This evaluation refers to version 1 of the preprint

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Endorsed by GigaByte (scotted400)
Evaluated articles (GigaByte)

Abstract

The DNA molecule is a promising next-generation data storage medium. Recently, it has been theoretically proposed that non-natural or modified bases can serve as extra molecular letters to increase the information density. However, this strategy is challenging due to the difficulty in synthesizing non-natural DNA sequences and their complex structure. Here, we described a practical DNA data storage transcoding scheme named R+ based on an expanded molecular alphabet that introduces 5-methylcytosine (5mC). We demonstrated its experimental validation by encoding one representative file into several 1.3∼1.6 kbps in vitro DNA fragments for nanopore sequencing. Our results show an average data recovery rate of 98.97% and 86.91% with and without reference, respectively. Our work validates the practicability of 5mC in DNA storage systems, with a potentially wide range of applications. Availability and implementation R+ is implemented in Python and the code is available under a MIT license at https://github.com/Incpink-Liu/DNA-storage-R_plus.

GigaByte
Feb 4, 2025

Editors Assessment:

DNA has huge potential as a data storage medium because of its incredibly high storage density and stability. This work addresses the potential of modified bases, specifically 5-methylcytosine (5mC), in enhancing DNA data storage systems. This paper introduces a transcoding scheme named R+, which incorporates this modified 5mC base to increase information density beyond the standard limits. By encoding various file types into DNA sequences of between 1.3 to 1.6 kb in size, this method achieves an average recovery rate of 98.97% (with reference), validating the effectiveness of the method. On top of a wet-lab protocol (hosted in protocols.io) for the experimental validation of the transcoding scheme, it also includes open source code for in-silico simulation tests. Peer review scruitinising the protocols and …

Editors Assessment:

DNA has huge potential as a data storage medium because of its incredibly high storage density and stability. This work addresses the potential of modified bases, specifically 5-methylcytosine (5mC), in enhancing DNA data storage systems. This paper introduces a transcoding scheme named R+, which incorporates this modified 5mC base to increase information density beyond the standard limits. By encoding various file types into DNA sequences of between 1.3 to 1.6 kb in size, this method achieves an average recovery rate of 98.97% (with reference), validating the effectiveness of the method. On top of a wet-lab protocol (hosted in protocols.io) for the experimental validation of the transcoding scheme, it also includes open source code for in-silico simulation tests. Peer review scruitinising the protocols and validation are reusable and provide convincing results. As nanopore sequencing has enabled reading of these modified bases, it is timely making them applicable as extra letters in the molecular alphabet for DNA data storage

This evaluation refers to version 1 of the preprint

Read the original source
GigaByte
Feb 4, 2025
AbstractDNA molecular is a promising next-generation data storage medium. Recently, it has been theoretically proposed that non-natural or modified bases can serve as extra molecular letters to increase the information density. However, the feasibility of the strategy is challenging due to the difficulty in synthesizing and the complex structure of non-natural DNA sequences. Here, we described a practical DNA data storage transcoding scheme named R+ based on expanded molecular alphabet by introducing 5-methlcytosine(5mC). We also demonstrated the experimental validation by encoding one representative file into several 1.3~1.6 kbps in vitro DNA fragments for nanopore sequencing. The results show an average data recovery rate of 98.97% and 86.91% with and without reference respectively. This work validates the practicability of 5mC in …
AbstractDNA molecular is a promising next-generation data storage medium. Recently, it has been theoretically proposed that non-natural or modified bases can serve as extra molecular letters to increase the information density. However, the feasibility of the strategy is challenging due to the difficulty in synthesizing and the complex structure of non-natural DNA sequences. Here, we described a practical DNA data storage transcoding scheme named R+ based on expanded molecular alphabet by introducing 5-methlcytosine(5mC). We also demonstrated the experimental validation by encoding one representative file into several 1.3~1.6 kbps in vitro DNA fragments for nanopore sequencing. The results show an average data recovery rate of 98.97% and 86.91% with and without reference respectively. This work validates the practicability of 5mC in DNA storage systems, with a potentially wide range of applications.Availability & Implementation R+ is implemented in Python and the code is available under the MIT license at https://github.com/Incpink-Liu/DNA-storage-R_plus

This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.147). These reviews (including a protocol review) are as follows.

Reviewer 1. Abdur Rasool

Is the source code available, and has an appropriate Open Source Initiative license been assigned to the code? However, the Git links have a typo; the working code is available at https://github.com/Incpink-Liu/DNA-storage-R_plus

Is the code executable?

Unable to test. Complete execution of the given code requires time and resources.

Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Unable to test. Additional Comments: This manuscript focuses on DNA data storage based on an expanded molecular alphabet. In view of the challenges of non-natural bases in synthesis, sequencing, and compatibility, the manuscript proposes a DNA data storage scheme containing 5-methylcytosine based on the theory that modified bases can replace non-natural bases as extra molecular letters and develops an adaptive transcoding algorithm named R+ for corresponding experimental validation. The high data recovery rate obtained from sequencing analysis demonstrates its practicability.

This manuscript provides a simple but relatively universal transcoding algorithm for DNA data storage that introduces additional molecular letters. The proposed DNA data storage scheme outperforms conventional DNA data storage in the potential development of information density. Considering the anticipated decrease in future synthesis costs and the expected advancements in relevant transcoding algorithms, my outlook remains optimistic regarding the potential application of this scheme. I suggest that the manuscript could be accepted after a few minor revisions listed below:

Figure 3 in the paper could be further modified, specifically minimizing the excess white space on both sides of Subfigure A to make it more aesthetically pleasing.

The subfigures A, B, and D in Figure 2 and Figure S2 both demonstrate the difference between poem.txt/program.py and the other four files. However, the manuscript lacks an explanation for this phenomenon. Is it relevant to the file size?

The 8 nt adaptors play a key role during the sequence assembly in the experimental validation, so I suggest supplementing the specific generation process of these linkers. Text descriptions or flow charts are acceptable.

It’s better to add the silico simulation to the Methods to make its structure more complete.

For the practicality of DNA storage, I suggest to cite https://onlinelibrary.wiley.com/doi/10.1002/smtd.202301585 and https://academic.oup.com/bib/article/25/5/bbae463/7759103.

Provide the correct URLs of GitHub links for reproducibility.

Reviewer 2. Bi Kun

Are there (ideally real world) examples demonstrating use of the software?

No. Additional Comments:

In this study, a practical DNA data storage transcoding scheme named R+ based on expanded molecular alphabet is proposed to increase the information density. The experimental validation demonstrates the practicability of DDS-5mC and highlight the enormous potential of modified bases represented by 5mC in the field of DNA data storage. Overall, the methods and results look appropriate and promising, but it has minor issues that need to be addressed currently.

1.Please indicate the proportion of substitution: insertion: deletion in the error rates of Fig. 4C and D. 2.What is the meaning of the vertical axis of Fig. 2B? Is it the number of homopolymers per sequence, the longest length of homopolymers, or something else? 3.Line 304, please add s, "References" 4.The last sentence of the Abstract: "This work validates the practicability of 5mC over other non-natural bases in DNA storage systems". Please correspond it with the last paragraph of Results (151-154). 5.If necessary, according to the guideline of this journal, section Conclusion can be added or not.

Reviewer 3. Lifu Song

This manuscript explores the application of 5-methylcytosine (5mC) as an additional molecular letter in DNA data storage systems, expanding the molecular alphabet to increase information density. The authors present a novel transcoding scheme (R+) and validate it with both in silico and experimental data. The study explores GC content, homopolymer distribution, and data recovery rates under various conditions, offering detailed insights into practical applications. Experimental validation with nanopore sequencing demonstrates real-world feasibility. By improving storage density and ensuring compatibility with nanopore sequencing, the study addresses significant challenges in incorporating non-natural bases into DNA storage systems. Overall, the manuscript is well-structured and addresses a highly relevant topic in DNA data storage, offering valuable contributions to the field. I recommend it for publication, subject to minor revisions to enhance clarity and precision.

Suggested minor revisions:

Although substitution errors, particularly between C and 5mC, were discussed, the manuscript does not provide a detailed explanation of how these errors affect long- term storage or large-scale applications—both of which are critical for archival storage, the primary use case of DNA data storage technology.

The manuscript could benefit from a broader comparison with other high-density DNA storage strategies, such as composite DNA letters, to contextualize the benefits and limitations of 5mC.

The discussion could be expanded to address practical challenges, such as strategies to reduce synthesis costs and improve sequencing accuracy for modified bases like 5mC, to provide a more holistic perspective on the technology's scalability.

Protocol Review: I have taken a look at the experiment protocol associated with this manuscript in the website of protocols.io. The protocol looks sensible. I don't have any additional comments about it and am happy for it to go live.

See: https://dx.doi.org/10.17504/protocols.io.q26g7mr78gwz/v1
Read the original source
Version published to 10.46471/gigabyte.147
Jan 24, 2025
Version published to 10.1101/2024.12.26.630439 on bioRxiv
Dec 26, 2024

Kappa-Frameshift Background Mutations and Long-Range Correlations of the DNA Base Sequences

This article has 1 author:
1. Elias Koorambas
This article has no evaluationsLatest version Dec 17, 2025
Nanopore Data-Driven Near-T2T Genome Assembly of <em>Hippophae rhamnoides</em> ssp. <em>mongolica</em> Rousi

This article has 15 authors:
1. Alexander Arkhipov
2. Nadezhda Bolsheva
3. Elena Pushkova
4. Vladislav Babenko
5. Yury Zubarev
6. Vera Kovalenko
7. Fedor Kostromskoy
8. Elizaveta Ivankina
9. Ekaterina Dvorianinova
10. Nikolai Barsukov
11. Daiana Krupskaya
12. Elena Borkhert
13. Ksenia Klimina
14. Nataliya Melnikova
15. Alexey Dmitriev
This article has no evaluationsLatest version Dec 15, 2025
Evaluation of Dorado v5.2.0 de novo basecalling models for the detection of tRNA modifications using RNA004 chemistry

This article has 4 authors:
1. Bhargesh Indravadan Patel
2. Franziskus N.M. Rübsam
3. Yu Sun
4. Ann E. Ehrenhofer-Murray
This article has no evaluationsLatest version Dec 23, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Kappa-Frameshift Background Mutations and Long-Range Correlations of the DNA Base Sequences

Nanopore Data-Driven Near-T2T Genome Assembly of <em>Hippophae rhamnoides</em> ssp. <em>mongolica</em> Rousi

Evaluation of Dorado v5.2.0 de novo basecalling models for the detection of tRNA modifications using RNA004 chemistry